# Data Analysis & Visualization with Python - Movie Ratings

In [None]:
# load libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# left-align tables in markdown
from IPython.core.display import HTML
table_css = "table {align:left;display:block}"
HTML("<style>{}</style>".format(table_css))

In [None]:
# import data
filePath = os.path.join(".", "data", "imdb.csv")
df = pd.read_csv(filePath)

# preview data
df.head()

In [None]:
# describe data
df.describe()

In [None]:
# size and shape of data
print("Size: ", df.size, "\tShape: ", df.shape, "\n\n")

# check for null values
# tells you if a null value exists in each column
print("Null Values\n", df.isnull().any(), "\n\n")

# tells you how many null values exist in each column
print("Null/NaN Count\n", df.isnull().sum(), "\n\n")

# check data types
df.dtypes

### How would you describe this data: clean or messy or somewhere in the middle?
### What problems might you run into based on what you know about the data so far?

In [None]:
# pie chart using matplotlib
p_pie = (
    df.groupby("Rate")["Rate"]
    .count()
    .plot.pie()
)
p_pie.set_title("Movie Ratings")

plt.show()

In [None]:
# pie chart using matplotlib
x = df.Rate.unique()
y = df.groupby("Rate")["Rate"].count()
pct = 100.0 * y / y.sum()

p_pie = df.groupby("Rate")["Rate"].count().plot.pie(labels=None)
p_pie.legend(
    labels=["{0} : {1:1.2f} %".format(i, j) for i, j in zip(x, pct)],
    loc="center right",
    bbox_to_anchor=(1, 0, 0.5, 1),
)
p_pie.set_title("IMDB Movie Grades")

plt.tight_layout()

plt.show()

## Let's make a better visual using the same data.

We'll start by grouping the ratings into letter grades, just like in school:

| Letter Grade | Score |
|---|---|
| A+ | 97–100% |
| A | 93–96% |
| A− | 90–92% |
| B+ | 87–89% |
| B | 83–86% |
| B− | 80–82% |
| C+ | 77–79% |
| C | 73–76% |
| C− | 70–72% |
| D+ | 67–69% |
| D | 63–66% |
| D− | 60–62% |
| F | 0–59% |

In [None]:
# We'll need to add a new "Grade" column to the dataframe.
# Let's use numpy to create a nested conditional statement and assign a Letter Grade to each movie in the dataframe:
df["Grade"] = np.where(
    df["Rate"] < 6,
    "F",
    np.where(
        df["Rate"] < 6.3,
        "D-",
        np.where(
            df["Rate"] < 6.7,
            "D",
            np.where(
                df["Rate"] < 7,
                "D+",
                np.where(
                    df["Rate"] < 7.3,
                    "C-",
                    np.where(
                        df["Rate"] < 7.7,
                        "C",
                        np.where(
                            df["Rate"] < 8,
                            "C+",
                            np.where(
                                df["Rate"] < 8.3,
                                "B-",
                                np.where(
                                    df["Rate"] < 8.7,
                                    "B",
                                    np.where(
                                        df["Rate"] < 9.0,
                                        "B+",
                                        np.where(
                                            df["Rate"] < 9.3,
                                            "A-",
                                            np.where(df["Rate"] < 9.7, "A", "A+"),
                                        ),
                                    ),
                                ),
                            ),
                        ),
                    ),
                ),
            ),
        ),
    ),
)

# Let's check to make sure our new "Grade" column is in the dataframe.
# This time, we'll preview the first 5 rows and only show the Title, Rate, and Grade columns:
df.loc[:, ["Title", "Rate", "Grade"]].head()

In [None]:
# Great, our new "Grade" column is now part of the dataframe.
# Let's see how our pie chart looks when we use the "Grade" column instead of the "Rate" column

x = np.char.array(["A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "D-", "F"])
y = df.groupby("Grade")["Grade"].count()
pct = 100.0 * y / y.sum()

p_pie = df.groupby("Grade")["Grade"].count().plot.pie(labels=None)
p_pie.legend(
    labels=["{0} : {1:1.2f} %".format(i, j) for i, j in zip(x, pct)],
    loc="center right",
    bbox_to_anchor=(1, 0, 0.5, 1),
)
p_pie.set_title("IMDB Movie Grades")

plt.tight_layout()

plt.show()

## Wow, that's much better! (Or is it?)

The technical term for what we've just done here is "binning," like "putting things into bins."
Binning is useful for creating groups of rows with similar (but not identical) values.  

## Questions

- Can you think of any examples where you've seen binning used in a chart or graph in the real world?
- If so, do you think it helped make the data easier to understand?  

## To to:

How would you visualise this data? Would you adjust the bins? Would you use the original data but a completely different kind of chart? Or would it be a combination of both? 
  



In [None]:
# show your solution to visualizing the Rate column


# Handling missing data

##### Here, we will create a copy of the dataframe for each of the three methods so we can more easily explore the differences in what happens in each scenario.
##### Creating a copy will allow us to make changes without also altering the original df (e.g. if we were to use df_new = df).

In [None]:
# create copies of the dataframe
df_drop = df.copy()
df_mean = df.copy()
df_median = df.copy()
df_mode = df.copy()

#### Filling with mode

#### We'll compare what happens to the Certificate column via countplot

In [None]:
# create countplot for Certificate
s_count = sns.countplot(
	data=df, x=df["Certificate"], order=df["Certificate"].value_counts().index
).set_title("Certificate Count")

# rotate x-axis tick marks for better visibility
plt.xticks(rotation=70)

plt.show()

##### Note that the chart above does not show any NaNs, yet we know that there are 27 NaN values - we haven't dropped anything. Seaborn does not include NaNs in its countplots. The benefit is that for this particular visual, it means we can skip writing that code. The downside is that it can have effects elsewhere that maybe we don't want.

##### We have a couple options:
* We can force Seaborn to count the NaN/null values by filling in the missing data; this this is categorical, we can just use the word "missing" or similar
* We can use Pandas or Matplotlib instead

In [None]:
# option 1: we can fill the Nan/null values with 'Missing' so Seaborn will count those

# create a copy of the dataframe
df_missing = df.copy()

s_count = sns.countplot(
    data=df_missing.Certificate.fillna("Missing", inplace=True),
    x=df_missing["Certificate"],
    order=df_missing["Certificate"].value_counts().index,
).set_title("Certificate Count")

# rotate x-axis tick marks for better visibility
plt.xticks(rotation=70)

plt.show()

In [None]:
# countplot using Pandas
df.Certificate.value_counts(dropna=False).plot(kind="bar").set_title(
    "Certificate Count"
)

plt.show()

##### We'll use fillna() to fill with the mode and we'll see how this looks when we plot this next.
##### Because we will be making a modification to the dataframe that we don't necessarily want to keep, we'll use df_mode

In [None]:
# fillna()
df_mode["Certificate"].fillna(df_mode["Certificate"].mode()[0], inplace=True)

# tells you how many null values exist in each column
print("Original Null/NaN Count\n", df.isnull().sum(), "\n\n")
print("New Null/NaN Count\n", df_mode.isnull().sum())

In [None]:
# show plots from two different dataframes in single visual

# create copies (again, we don't necessarily want to alter what we've done to these so far - this won't always be the case)
df1 = df.copy()
df2 = df_mode.copy()

# add new column called 'Key' to each of the new dataframes we created
df1["Key"] = "Original"
df2["Key"] = "fillna mode"

# combine the two dataframes
df_new = pd.concat([df1, df2], keys=["Original", "fillna mode"])
dfgroup = df_new.groupby(["Certificate", "Key"])

# create plot using matplotlib
dfgroup_plot = dfgroup["Certificate"].count().unstack("Key").plot(kind="bar")
dfgroup_plot.set_title("Original vs fillna mode Certificate Count")

# rotate x-axis tick marks for better visibility
plt.xticks(rotation=70)

plt.show()

In [None]:
# same as above, but to show two separate plots in a single output

# set subplots
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5), sharex=False)

# adjust spacing between subplots (if vertical, would use h_pad)
f.tight_layout(w_pad=3)

sns.countplot(
    data=df_missing,
    x="Certificate",
    order=df_missing["Certificate"].value_counts().index,
    ax=ax1,
).set_title("Countplot: Original")
sns.countplot(
    data=df_mode,
    x="Certificate",
    order=df_mode["Certificate"].value_counts().index,
    ax=ax2,
).set_title("Countplot: Fillna Mode")

# rotate x-axis tick marks for better visibility
# we cannot use plt.xticks (rotation = []) because of how we have these set up
# were you to show each figure in a different cell, you could use plt.xticks
ax1.tick_params(axis="x", labelrotation=90)
ax2.tick_params(axis="x", labelrotation=90)

plt.show()

##### We can see that between the original and filling NaNs with the mode that every count stayed the same except for 'R'
##### Since R started as the certificate with the highest count and we used the mode to fill the NaNs

#### Test out different methods the Metascore column

##### Try using countplots, histograms, or other visuals as fits well with the data

In [None]:
# check status of dfs so you know what you're working with first
# fill missing values as necessary

# check status of dfs
# hint: use print statements to see them all in one output

In [None]:
# fillna with mean

# check that no NaN/nulls are remaining
# it's always good to check that your code has done what you think you told it to do

In [None]:
# fillna with mean

# check that no NaN/nulls are remaining
# it's always good to check that your code has done what you think you told it to do

In [None]:
# fillna with median

# check that no NaN/nulls are remaining
# it's always good to check that your code has done what you think you told it to do

In [None]:
# fillna with mode

# check that no NaN/nulls are remaining
# it's always good to check that your code has done what you think you told it to do

In [None]:
# check status of dfs again to make sure you didn't miss anything
# we won't worry about imputing missing categorical data with statistical methods

In [None]:
# for the purpose of this exercise, we'll drop missing categorical data, though this is not always what you want to do
# because there are so few categorical data missing, we'll just drop them and recheck that we have no missing categorical values

# check status of dfs again to make sure you didn't miss anything

#### Method 1 or Visual Choice 1

Follow either the same method for all your visuals  (e.g. create all visuals using dropna or fillna mean)
-OR-
Use the same visual for multiple methods to see the effect different methods have on your visuals

You can create single visuals or create multiple visuals in a single output - you decide what you want to look at and why!

Think about what you want to make and why - maybe you want to deliberately make a good visual and maybe you don't.
Regardless, after you create a visual ask yourself:
1. Does it look the way I expected it to? Why or why not?
2. Was this useful? Why or why not?
2. What did I learn from this visual and why is it important?

In [None]:
# visual 1

#### Method 2 or Visual Choice 2

In [None]:
# visual 2

#### Method 3 or Visual Choice 3

In [None]:
# visual 3

#### Questions to ask yourself
##### Did all your visuals turn out as useful as you thought they would be? Why or why not?
##### What did you learn?