## Basic Python - Project <a id='intro'></a>

## Introduction <a id='intro'></a>
In this project, you will work with data from the entertainment industry. You will study a dataset with records on movies and shows. The research will focus on the "Golden Age" of television, which began in 1999 with the release of *The Sopranos* and is still ongoing.

The aim of this project is to investigate how the number of votes a title receives impacts its ratings. The assumption is that highly-rated shows (we will focus on TV shows, ignoring movies) released during the "Golden Age" of television also have the most votes.

### Stages 
Data on movies and shows is stored in the `/datasets/movies_and_shows.csv` file. There is no information about the quality of the data, so you will need to explore it before doing the analysis.

First, you'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, you will try to account for the most critical problems.
 
Your project will consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Data analysis

## Stage 1. Data overview <a id='data_review'></a>

Open and explore the data.

You'll need `pandas`, so import it.

In [1]:
# importing pandas
import pandas as pd

Read the `movies_and_shows.csv` file from the `datasets` folder and save it in the `df` variable:

In [2]:
try:
    # Assuming the variable 'server_path' holds alternative paths
    orders = pd.read_csv('/datasets/movies_and_shows.csv')
except FileNotFoundError:
    # Provide the correct path or operation in the exception block
    orders = pd.read_csv(server_path['orders'])
    print(orders)

Print the first 10 table rows:

In [3]:
# obtaining the first 10 rows from the df table
# hint: you can use head() and tail() in Jupyter Notebook without wrapping them into print()
import pandas as pd

df = pd.read_csv('/datasets/movies_and_shows.csv')

print(df.head(10))

              name                      Character   r0le        TITLE   Type  \
0   Robert De Niro                  Travis Bickle  ACTOR  Taxi Driver  MOVIE   
1     Jodie Foster                  Iris Steensma  ACTOR  Taxi Driver  MOVIE   
2    Albert Brooks                            Tom  ACTOR  Taxi Driver  MOVIE   
3    Harvey Keitel        Matthew 'Sport' Higgins  ACTOR  Taxi Driver  MOVIE   
4  Cybill Shepherd                          Betsy  ACTOR  Taxi Driver  MOVIE   
5      Peter Boyle                         Wizard  ACTOR  Taxi Driver  MOVIE   
6   Leonard Harris      Senator Charles Palantine  ACTOR  Taxi Driver  MOVIE   
7   Diahnne Abbott                Concession Girl  ACTOR  Taxi Driver  MOVIE   
8      Gino Ardito             Policeman at Rally  ACTOR  Taxi Driver  MOVIE   
9  Martin Scorsese  Passenger Watching Silhouette  ACTOR  Taxi Driver  MOVIE   

   release Year              genres  imdb sc0re  imdb v0tes  
0          1976  ['drama', 'crime']         8.2    808582

Obtain the general information about the table with one command:

In [4]:
# obtaining general information about the data in df

# Load the dataset into a DataFrame
df = pd.read_csv('/datasets/movies_and_shows.csv')

print(df.head(10))
df = df.rename(columns={'Character': 'character',
                        'r0le': 'role',
                        'TITLE': 'title',
                        '  Type': 'type',
                        'release Year': 'release_year',
                        'imdb sc0re': 'imdb_score', 'imdb v0tes': 'imdb_votes'})
df.info()
print(df)

              name                      Character   r0le        TITLE   Type  \
0   Robert De Niro                  Travis Bickle  ACTOR  Taxi Driver  MOVIE   
1     Jodie Foster                  Iris Steensma  ACTOR  Taxi Driver  MOVIE   
2    Albert Brooks                            Tom  ACTOR  Taxi Driver  MOVIE   
3    Harvey Keitel        Matthew 'Sport' Higgins  ACTOR  Taxi Driver  MOVIE   
4  Cybill Shepherd                          Betsy  ACTOR  Taxi Driver  MOVIE   
5      Peter Boyle                         Wizard  ACTOR  Taxi Driver  MOVIE   
6   Leonard Harris      Senator Charles Palantine  ACTOR  Taxi Driver  MOVIE   
7   Diahnne Abbott                Concession Girl  ACTOR  Taxi Driver  MOVIE   
8      Gino Ardito             Policeman at Rally  ACTOR  Taxi Driver  MOVIE   
9  Martin Scorsese  Passenger Watching Silhouette  ACTOR  Taxi Driver  MOVIE   

   release Year              genres  imdb sc0re  imdb v0tes  
0          1976  ['drama', 'crime']         8.2    808582

The table contains nine columns. The majority store the same data type: object. The only exceptions are `'release Year'` (int64 type), `'imdb sc0re'` (float64 type) and `'imdb v0tes'` (float64 type). Scores and votes will be used in our analysis, so it's important to verify that they are present in the dataframe in the appropriate numeric format. Three columns (`'TITLE'`, `'imdb sc0re'` and `'imdb v0tes'`) have missing values.

According to the documentation:
- `'name'` — actor/director's name and last name
- `'Character'` — character played (for actors)
- `'r0le '` — the person's contribution to the title (it can be in the capacity of either actor or director)
- `'TITLE '` — title of the movie (show)
- `'  Type'` — show or movie
- `'release Year'` — year when movie (show) was released
- `'genres'` — list of genres under which the movie (show) falls
- `'imdb sc0re'` — score on IMDb
- `'imdb v0tes'` — votes on IMDb

We can see three issues with the column names:
1. Some names are uppercase, while others are lowercase.
2. There are names containing whitespace.
3. A few column names have digit '0' instead of letter 'o'. 


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data about a movie or show. The columns can be divided into two categories: the first is about the roles held by different people who worked on the movie or show (role, name of the actor or director, and character if the row is about an actor); the second category is information about the movie or show itself (title, release year, genre, imdb figures).

It's clear that there is sufficient data to do the analysis and evaluate our assumption. However, to move forward, we need to preprocess the data.

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>
Correct the formatting in the column headers and deal with the missing values. Then, check whether there are duplicates in the data.

In [5]:
# the list of column names in the df table
df.info()
print(df.columns)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85579 entries, 0 to 85578
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0      name       85579 non-null  object 
 1   character     85579 non-null  object 
 2   role          85579 non-null  object 
 3   title         85578 non-null  object 
 4   type          85579 non-null  object 
 5   release_year  85579 non-null  int64  
 6   genres        85579 non-null  object 
 7   imdb_score    80970 non-null  float64
 8   imdb_votes    80853 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 5.9+ MB
Index(['   name', 'character', 'role', 'title', 'type', 'release_year',
       'genres', 'imdb_score', 'imdb_votes'],
      dtype='object')


Change the column names according to the rules of good style:
* If the name has several words, use snake_case
* All characters must be lowercase
* Remove whitespace
* Replace zero with letter 'o'

In [6]:
# renaming columns
df.columns = df.columns.str.lower().str.replace(' ', '').str.replace('0', 'o')

Check the result. Print the names of the columns once more:

In [7]:
# checking result: the list of column names
print(df)

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
85574    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
85575  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
85576        In??s Prieto                    Fanny     ACTOR      Lokillo   
85577        Isabel Gaona                   Cacica     ACTOR      Lokillo   
85578      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

            type  release_year              genres  imdb_score  imdb_votes 

### Missing values <a id='missing_values'></a>
First, find the number of missing values in the table. To do so, combine two `pandas` methods:

In [8]:
# calculating missing values
print(df.head())
print()
print(df.tail())

              name                character   role        title   type  \
0   Robert De Niro            Travis Bickle  ACTOR  Taxi Driver  MOVIE   
1     Jodie Foster            Iris Steensma  ACTOR  Taxi Driver  MOVIE   
2    Albert Brooks                      Tom  ACTOR  Taxi Driver  MOVIE   
3    Harvey Keitel  Matthew 'Sport' Higgins  ACTOR  Taxi Driver  MOVIE   
4  Cybill Shepherd                    Betsy  ACTOR  Taxi Driver  MOVIE   

   release_year              genres  imdb_score  imdb_votes  
0          1976  ['drama', 'crime']         8.2    808582.0  
1          1976  ['drama', 'crime']         8.2    808582.0  
2          1976  ['drama', 'crime']         8.2    808582.0  
3          1976  ['drama', 'crime']         8.2    808582.0  
4          1976  ['drama', 'crime']         8.2    808582.0  

                     name     character      role    title       type  \
85574    Adelaida Buscato    Mar??a Paz     ACTOR  Lokillo  the movie   
85575  Luz Stella Luengas  Karen Bay

Not all missing values affect the research: the single missing value in `'title'` is not critical. The missing values in columns `'imdb_score'` and `'imdb_votes'` represent around 6% of all records (4,609 and 4,726, respectively, of the total 85,579). This could potentially affect our research. To avoid this issue, we will drop rows with missing values in the `'imdb_score'` and `'imdb_votes'` columns.

In [9]:
# dropping rows where columns with title, scores and votes have missing values
print(df.dropna())

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
85574    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
85575  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
85576        In??s Prieto                    Fanny     ACTOR      Lokillo   
85577        Isabel Gaona                   Cacica     ACTOR      Lokillo   
85578      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

            type  release_year              genres  imdb_score  imdb_votes 

Make sure the table doesn't contain any more missing values. Count the missing values again.

In [10]:
# counting missing values
print(df.isnull())

        name  character   role  title   type  release_year  genres  \
0      False      False  False  False  False         False   False   
1      False      False  False  False  False         False   False   
2      False      False  False  False  False         False   False   
3      False      False  False  False  False         False   False   
4      False      False  False  False  False         False   False   
...      ...        ...    ...    ...    ...           ...     ...   
85574  False      False  False  False  False         False   False   
85575  False      False  False  False  False         False   False   
85576  False      False  False  False  False         False   False   
85577  False      False  False  False  False         False   False   
85578  False      False  False  False  False         False   False   

       imdb_score  imdb_votes  
0           False       False  
1           False       False  
2           False       False  
3           False       False  

In [11]:
df.isna().sum()

name               0
character          0
role               0
title              1
type               0
release_year       0
genres             0
imdb_score      4609
imdb_votes      4726
dtype: int64

### Duplicates <a id='duplicates'></a>
Find the number of duplicate rows in the table using one command:

In [12]:
# counting duplicate rows
print(df.duplicated())

0        False
1        False
2        False
3        False
4        False
         ...  
85574    False
85575    False
85576    False
85577     True
85578    False
Length: 85579, dtype: bool


Review the duplicate rows to determine if removing them would distort our dataset.

In [13]:
# Produce table with duplicates (with original rows included) and review last 5 rows
df = df.drop_duplicates().reset_index(drop=True)
df = df.dropna()
print(df)

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
77899       A??da Morales                  Maritza     ACTOR      Lokillo   
77900    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
77901  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
77902        In??s Prieto                    Fanny     ACTOR      Lokillo   
77903      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

            type  release_year              genres  imdb_score  imdb_votes 

There are two clear duplicates in the printed rows. We can safely remove them.
Call the `pandas` method for getting rid of duplicate rows:

In [14]:
# removing duplicate rows
print(df.drop_duplicates())

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
77899       A??da Morales                  Maritza     ACTOR      Lokillo   
77900    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
77901  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
77902        In??s Prieto                    Fanny     ACTOR      Lokillo   
77903      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

            type  release_year              genres  imdb_score  imdb_votes 

Check for duplicate rows once more to make sure you have removed all of them:

In [15]:
# checking for duplicates
print(df)

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
77899       A??da Morales                  Maritza     ACTOR      Lokillo   
77900    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
77901  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
77902        In??s Prieto                    Fanny     ACTOR      Lokillo   
77903      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

            type  release_year              genres  imdb_score  imdb_votes 

Now get rid of implicit duplicates in the `'type'` column. For example, the string `'SHOW'` can be written in different ways. These kinds of errors will also affect the result.

Print a list of unique `'type'` names, sorted in alphabetical order. To do so:
* Retrieve the intended dataframe column 
* Apply a sorting method to it
* For the sorted column, call the method that will return all unique column values

In [16]:
# viewing unique type names
print(df.nunique())

name            52270
character       45492
role                2
title            5008
type                9
release_year       62
genres           1645
imdb_score         78
imdb_votes       3811
dtype: int64


Look through the list to find implicit duplicates of `'show'` (`'movie'` duplicates will be ignored since the assumption is about shows). These could be names written incorrectly or alternative names of the same genre.

You will see the following implicit duplicates:
* `'shows'`
* `'SHOW'`
* `'tv show'`
* `'tv shows'`
* `'tv series'`
* `'tv'`

To get rid of them, declare the function `replace_wrong_show()` with two parameters: 
* `wrong_shows_list=` — the list of duplicates
* `correct_show=` — the string with the correct value

The function should correct the names in the `'type'` column from the `df` table (i.e., replace each value from the `wrong_shows_list` list with the value in `correct_show`).

Call `replace_wrong_show()` and pass it arguments so that it clears implicit duplicates and replaces them with `SHOW`:

In [17]:
# function for replacing implicit duplicates
def replace_wrong_show(df, wrong_shows_list, correct_show):
    for wrong_show in wrong_shows_list:
        df['type'] = df['type'].replace(wrong_show, correct_show)
    return df

In [18]:
# removing implicit duplicates
wrong_shows_list = ["tv show", "the movie", "shows", "tv shows", "tv series", "tv", "movies", "MOVIE"]  # Example incorrect values
correct_show = "SHOW"
df = replace_wrong_show(df, wrong_shows_list, correct_show)

In [19]:
df.type.unique()

array(['SHOW'], dtype=object)

Make sure the duplicate names are removed. Print the list of unique values from the `'type'` column:

In [20]:
# viewing unique genre names
df.type.unique()
wrong_shows_list = ['SHOW', 'shows', 'tv', 'tv series', 'tv show', 'tv shows']
correct_show = 'SHOW'
df = replace_wrong_show(df, wrong_shows_list, correct_show)
print(df)

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
77899       A??da Morales                  Maritza     ACTOR      Lokillo   
77900    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
77901  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
77902        In??s Prieto                    Fanny     ACTOR      Lokillo   
77903      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

       type  release_year              genres  imdb_score  imdb_votes  
0  

In [21]:
print(df)

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
77899       A??da Morales                  Maritza     ACTOR      Lokillo   
77900    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
77901  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
77902        In??s Prieto                    Fanny     ACTOR      Lokillo   
77903      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

       type  release_year              genres  imdb_score  imdb_votes  
0  

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Duplicate rows and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All rows with missing values have been removed. 

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to our analysis of the prepared data.

## Stage 3. Data analysis <a id='hypotheses'></a>

Based on the previous project stages, you can now define how the assumption will be checked. Calculate the average amount of votes for each score (this data is available in the `imdb_score` and `imdb_votes` columns), and then check how these averages relate to each other. If the averages for shows with the highest scores are bigger than those for shows with lower scores, the assumption appears to be true.

Based on this, complete the following steps:

- Filter the dataframe to only include shows released in 1999 or later.
- Group scores into buckets by rounding the values of the appropriate column (a set of 1-10 integers will help us make the outcome of our calculations more evident without damaging the quality of our research).
- Identify outliers among scores based on their number of votes, and exclude scores with few votes.
- Calculate the average votes for each score and check whether the assumption matches the results.

To filter the dataframe and only include shows released in 1999 or later, you will take two steps. First, keep only titles published in 1999 or later in our dataframe. Then, filter the table to only contain shows (movies will be removed).

In [22]:
# using conditional indexing modify df so it has only titles released after 1999 (with 1999 included)
# give the slice of dataframe new name
def shows_after_99(df):
    for shows in df:
        if shows >= 1999:
            print(shows)
        
        
    

In [23]:
# repeat conditional indexing so df has only shows (movies are removed as result)
def movies_after_99(df):
    for movies in df:
        if movies >= 1999:
            print(movies)
# Filter shows released after 1999
filtered_df = df[(df['release_year'] > 1999) & (df['type'] == 'SHOW')]

# Rounding the scores
filtered_df['rounded_score'] = filtered_df['imdb_score'].round()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['rounded_score'] = filtered_df['imdb_score'].round()


The scores that are to be grouped should be rounded. For instance, titles with scores like 7.8, 8.1, and 8.3 will all be placed in the same bucket with a score of 8.

In [24]:
# rounding column with scores
print(df.round())
#checking the outcome with tail()
print(df.head())
print()
print(df.tail())


                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
77899       A??da Morales                  Maritza     ACTOR      Lokillo   
77900    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
77901  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
77902        In??s Prieto                    Fanny     ACTOR      Lokillo   
77903      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

       type  release_year              genres  imdb_score  imdb_votes  
0  

It is now time to identify outliers based on the number of votes.

In [25]:
# Use groupby() for scores and count all unique values in each group, print the result
print(df.groupby('imdb_score'))
print(df.imdb_score)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f3809a63370>
0        8.2
1        8.2
2        8.2
3        8.2
4        8.2
        ... 
77899    3.8
77900    3.8
77901    3.8
77902    3.8
77903    3.8
Name: imdb_score, Length: 73859, dtype: float64


Based on the aggregation performed, it is evident that scores 2 (24 voted shows), 3 (27 voted shows), and 10 (only 8 voted shows) are outliers. There isn't enough data for these scores for the average number of votes to be meaningful.

To obtain the mean numbers of votes for the selected scores (we identified a range of 4-9 as acceptable), use conditional filtering and grouping.

In [26]:
# filter dataframe using two conditions (scores to be in the range 4-9)
mean = df.groupby('imdb_score').mean()
print(mean)

# group scores and corresponding average number of votes, reset index and print the result
filtered_df = filtered_df[(filtered_df['imdb_score']>= 4) & (filtered_df['imdb_score']<= 9)]
filtered_df

grouped_scores_votes = filtered_df.groupby('rounded_score')['imdb_votes'].mean().reset_index()
print(grouped_scores_votes)

            release_year    imdb_votes
imdb_score                            
1.5          2016.000000  3.880000e+02
1.6          2020.000000  3.700000e+02
1.7          2013.000000  8.648000e+03
1.9          2016.000000  4.356000e+03
2.0          2015.592593  2.506704e+03
...                  ...           ...
9.0          2010.315789  1.537102e+05
9.1          2020.882353  4.935488e+04
9.2          2015.000000  6.718000e+03
9.3          2006.166667  2.818821e+05
9.5          2008.000000  1.775990e+06

[78 rows x 2 columns]
   rounded_score     imdb_votes
0            4.0    9168.943609
1            5.0   14241.794418
2            6.0   28665.995804
3            7.0   50708.480627
4            8.0  145883.125487
5            9.0  265172.031627


Now for the final step! Round the column with the averages, rename both columns, and print the dataframe in descending order.

In [27]:
# round column with averages
round = df['imdb_score'].round()

# rename columns
df.columns = df.columns.str.replace('releaseyear', 'release_year')
df.columns = df.columns.str.replace('imdbvotes', 'imdb_votes')

# print dataframe in descending order
print(df)

                     name                character      role        title  \
0          Robert De Niro            Travis Bickle     ACTOR  Taxi Driver   
1            Jodie Foster            Iris Steensma     ACTOR  Taxi Driver   
2           Albert Brooks                      Tom     ACTOR  Taxi Driver   
3           Harvey Keitel  Matthew 'Sport' Higgins     ACTOR  Taxi Driver   
4         Cybill Shepherd                    Betsy     ACTOR  Taxi Driver   
...                   ...                      ...       ...          ...   
77899       A??da Morales                  Maritza     ACTOR      Lokillo   
77900    Adelaida Buscato               Mar??a Paz     ACTOR      Lokillo   
77901  Luz Stella Luengas             Karen Bayona     ACTOR      Lokillo   
77902        In??s Prieto                    Fanny     ACTOR      Lokillo   
77903      Julian Gaviria                  unknown  DIRECTOR      Lokillo   

       type  release_year              genres  imdb_score  imdb_votes  
0  

The assumption macthes the analysis: the shows with the top 3 scores have the most amounts of votes.

## Conclusion <a id='hypotheses'></a>

The research done confirms that highly-rated shows released during the "Golden Age" of television also have the most votes. While shows with score 4 have more votes than ones with scores 5 and 6, the top three (scores 7-9) have the largest number. The data studied represents around 94% of the original set, so we can be confident in our findings.