# Netflix and IMDB Data Analysis Project – Data Analysis

After data cleaning, we store the data in SQLite to make the analysis more efficient. This allows us to run SQL queries directly for data exploration and statistical analysis.

In [22]:
import sqlite3
import pandas as pd

conn = sqlite3.connect('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb.db')

df = pd.read_csv('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb_matched.csv', sep='\t')

df.to_sql('netflix_imdb_data_now', conn, if_exists='replace', index=False)

result = pd.read_sql_query("SELECT COUNT(*) as total_movies FROM netflix_imdb_data_now", conn)
print(result)

conn.close()

   total_movies
0          7278


### 1. Relationship Between IMDb Rating and Number of Votes

In [23]:
conn = sqlite3.connect('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb.db')

votes_query = pd.read_sql_query("""
SELECT 
    title,
    release_year,
    imdb_rating, 
    imdb_votes
FROM 
    netflix_imdb_data_now
WHERE 
    imdb_votes IS NOT NULL
ORDER BY 
    imdb_votes DESC
LIMIT 10;
""", conn)

print(votes_query)

conn.close()

                                           title  release_year  imdb_rating  \
0                                      Inception        2010.0          8.8   
1                                   Pulp Fiction        1994.0          8.9   
2                                     The Matrix        1999.0          8.7   
3  The Lord of the Rings: The Return of the King        2003.0          9.0   
4          The Lord of the Rings: The Two Towers        2002.0          8.8   
5                               Django Unchained        2012.0          8.5   
6                           Inglourious Basterds        2009.0          8.4   
7                                 Shutter Island        2010.0          8.2   
8                               Schindler's List        1993.0          9.0   
9                                   The Departed        2006.0          8.5   

   imdb_votes  
0     2641978  
1     2297137  
2     2124883  
3     2049730  
4     1847887  
5     1759561  
6     1650129  
7 

This analysis explores the relationship between movie ratings and popularity. We observe that movies with higher IMDb ratings often receive more votes, indicating that popular, high-quality films tend to attract more audience engagement.

### 2. Decade-wise IMDb Rating Analysis

In [24]:
def run_query(sql):
    conn = sqlite3.connect('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb.db')
    result = pd.read_sql_query(sql, conn)
    conn.close()
    return result

sql = """
SELECT 
    CASE 
        WHEN release_year BETWEEN 1920 AND 1929 THEN '1920s'
        WHEN release_year BETWEEN 1930 AND 1939 THEN '1930s'
        WHEN release_year BETWEEN 1940 AND 1949 THEN '1940s'
        WHEN release_year BETWEEN 1950 AND 1959 THEN '1950s'
        WHEN release_year BETWEEN 1960 AND 1969 THEN '1960s'
        WHEN release_year BETWEEN 1970 AND 1979 THEN '1970s'
        WHEN release_year BETWEEN 1980 AND 1989 THEN '1980s'
        WHEN release_year BETWEEN 1990 AND 1999 THEN '1990s'
        WHEN release_year BETWEEN 2000 AND 2009 THEN '2000s'
        WHEN release_year BETWEEN 2010 AND 2019 THEN '2010s'
        WHEN release_year BETWEEN 2020 AND 2029 THEN '2020s'
        ELSE 'Other' 
    END AS decade,
    AVG(imdb_rating) AS avg_rating,
    COUNT(*) AS num_movies
FROM 
    netflix_imdb_data_now
WHERE 
    release_year IS NOT NULL
GROUP BY 
    decade
ORDER BY 
    decade;



"""

result = run_query(sql)
print(result)

  decade  avg_rating  num_movies
0  1940s    6.316667           6
1  1950s    7.118182          11
2  1960s    7.313043          23
3  1970s    7.032203          59
4  1980s    6.799074         108
5  1990s    6.568619         239
6  2000s    6.546554         769
7  2010s    6.460171        4798
8  2020s    6.460711        1265


The average IMDb rating shows a declining trend over the decades. Highest average ratings are observed in the 1960s (7.31) and 1950s (7.12). The lowest average ratings are seen in the 2010s (6.46) and 2020s (6.46).

### 3. Best Directors Based on IMDb Ratings

In [25]:
def run_query(sql):
    conn = sqlite3.connect('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb.db')
    result = pd.read_sql_query(sql, conn)
    conn.close()
    return result

sql = """

SELECT 
    director,
    AVG(imdb_rating) AS avg_rating,
    COUNT(*) AS num_movies
FROM 
    netflix_imdb_data_now
WHERE 
    director IS NOT NULL
GROUP BY 
    director
HAVING 
    COUNT(*) >= 3
ORDER BY 
    avg_rating DESC
	LIMIT 15;

"""

result = run_query(sql)
print(result)

                       director  avg_rating  num_movies
0           Michel Hazanavicius    8.266667           3
1               Joel Schumacher    8.266667           3
2             Quentin Tarantino    8.237500           8
3          Hrishikesh Mukherjee    8.125000           4
4                 Barak Goodman    8.100000           3
5                   Oriol Paulo    8.080000           5
6                Anurag Kashyap    8.066667           3
7                   Stan Lathan    8.066667           3
8                  Dean DeBlois    8.066667           3
9   Daniel Lindsay, T.J. Martin    8.066667           3
10              Rajkumar Hirani    8.033333           3
11            Christopher Nolan    8.033333           3
12            Stefano Lodovichi    8.000000           3
13                Peter Jackson    7.942857           7
14                  David Batty    7.925000           4


This table showcases the top directors based on their average IMDb ratings and the number of movies they’ve directed that are present in the dataset.

### 4. High-Rated Movies

In [26]:
def run_query(sql):
    conn = sqlite3.connect('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb.db')
    result = pd.read_sql_query(sql, conn)
    conn.close()
    return result

sql = """

SELECT 
    release_year,
    COUNT(*) FILTER (WHERE imdb_rating >= 8) AS high_rating_movies,
    RANK() OVER (ORDER BY COUNT(*) FILTER (WHERE imdb_rating >= 8) DESC) AS rank
FROM 
    netflix_imdb_data_now
GROUP BY 
    release_year
ORDER BY 
    high_rating_movies DESC
LIMIT 10;



"""

result = run_query(sql)
print(result)

   release_year  high_rating_movies  rank
0        2018.0                  89     1
1        2019.0                  84     2
2        2017.0                  76     3
3        2020.0                  72     4
4        2016.0                  51     5
5        2015.0                  42     6
6        2021.0                  40     7
7        2014.0                  31     8
8        2013.0                  25     9
9        2011.0                  23    10


The data reveals an interesting trend—the years from 2015 to 2021 dominate the top 10 rankings for the most high-rated movies (IMDb rating ≥ 8). 2018 takes the crown with a whopping 89 high-rated films, followed closely by 2019 and 2017.

In the previous "Decade-wise IMDb Rating Analysis", we found that the 1960s (7.31) and 1950s (7.12) had the highest average IMDb ratings.

However, when we looked at the Top 10 Years with the Most High-Rated Movies, these decades didn’t even make the list. I think this could be because of a “volume effect”. In the 1950s and 1960s, fewer movies were produced, but they were often high-quality classics that stood the test of time. In the 2010s and 2020s, there were a lot more movies, especially because of streaming platforms. Even though there are many great movies, there are also a lot of average or bad ones, which makes the average rating lower.


### 5. Low-Rated Movies 

In [27]:
def run_query(sql):
    conn = sqlite3.connect('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb.db')
    result = pd.read_sql_query(sql, conn)
    conn.close()
    return result

sql = """
SELECT
    release_year,
    COUNT(*) FILTER (WHERE imdb_rating <= 4) AS low_rating_movies,
    RANK() OVER (ORDER BY COUNT(*) FILTER (WHERE imdb_rating <= 3) DESC) AS rank
FROM
    netflix_imdb_data_now
GROUP BY
    release_year
ORDER BY
    low_rating_movies DESC
LIMIT 10;



"""

result = run_query(sql)
print(result)

   release_year  low_rating_movies  rank
0        2018.0                 45     3
1        2020.0                 37     4
2        2019.0                 35     2
3        2017.0                 35     4
4        2016.0                 29     1
5        2014.0                 14    10
6        2021.0                 12     6
7        2013.0                  8     7
8        2015.0                  8    20
9        2012.0                  7    11


From the results, 2016 holds the top spot with the most low-rated movies, followed closely by 2019, 2018, and 2020. 2018 had both a high number of top-rated movies and low-rated ones.

### 6. The Most Inspired Directors

In [28]:
def run_query(sql):
    conn = sqlite3.connect('/Users/mimi/Documents/imdb-analysis/data/netflix_imdb.db')
    result = pd.read_sql_query(sql, conn)
    conn.close()
    return result

sql = """

SELECT 
    director, 
    COUNT(*) AS works
FROM 
    netflix_imdb_data_now
WHERE 
    director IS NOT NULL
GROUP BY 
    director
ORDER BY 
    works DESC
LIMIT 10;


"""

result = run_query(sql)
print(result)

                 director  works
0            Marcus Raboy     15
1  Raúl Campos, Jan Suter     14
2         Youssef Chahine     13
3              Ron Howard     13
4        Steven Spielberg     12
5           Mike Flanagan     12
6                     McG     12
7               Jay Karas     12
8     Cathy Garcia-Molina     12
9         Martin Scorsese     11


Marcus Raboy tops the list with 15 works, proving to be the most inspired director in terms of output. Raúl Campos and Jan Suter, a dynamic directing duo, follow closely with 14 works, showcasing their consistent collaboration.