In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

### Set default options for Pandas to display more columns

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

### Read the CSV that exists inside this directory

We can optionally add `encoding='utf-8'` argument to ensure that the file is coming in with the correct encoding. 

The `df.head()` option will display the first five rows. We can also add an argument to head to display additional rows.

In [None]:
df = pd.read_csv('epicurious-recipes.csv', encoding='utf-8')
df.head()

In [None]:
len(df)

### Graphing comparisons of these columns

If we have multiple numerical columns, we can use seaborn to draw something called a **pairplot** to display scatter charts of all of the different combinations that exist within our dataset. This is a great way to see if any relationships exist within our dataset.

In [None]:
# Seaborn Gallery: https://seaborn.pydata.org/examples/index.html

sns.pairplot(df, vars=['calories','protein','fat','sodium'])

### Let's make a category plot

Rating is a difficult metric to analyze, because there are only a few values to consider. This dataset only has about 20 different rating values between 0 and 5, so it becomes more of a category than a granular numerical quantity. 

Let's trying comparing it to some of the nutrients in our dataset and see if we notice any patterns.

In [None]:
sns.catplot(x="rating", y="fat", kind="box", data=df)

### Looking at the data in different ways

We can look at the data in lots of different ways. One type of chart, called a **kernel density estimate** (or kde for short) can give us levels of intensity of where our values generally lie. But it's difficult to discern a clear pattern.

In [None]:
sns.jointplot(x="rating", y="calories", data=df, kind="kde", dropna=True);

### Calculate R squared

To get a better understanding of the relationship betwen ratings and nutrition, let's calculate the slope and find the R². 

In [None]:
# Function to calcuate r2
def r2(x,y):
    return stats.pearsonr(x,y)[0] ** 2

j = sns.jointplot(x="rating", y="fat", data=df, kind="reg", dropna=True);
j.annotate(r2)
plt.show()

### See the distribution of any one variable

We can also plot the numerical distribtion of any single variable to understand its shape across all ranges. 

In [None]:
sns.distplot(df['rating'],bins=10)

##optionally add median line
median = df['rating'].median()
twentyfive = np.percentile(df['rating'], 25)
seventyfive = np.percentile(df['rating'], 75)

#plt.axvline(median, color='b', linestyle='--')

### Closer look at other comparisons

We can do jointplots for other metrics in our dataset.

In [None]:
sns.jointplot(x="calories", y="fat", data=df, kind="reg")

## Closer look at the outliers

We can take a closer look at those two outliers that have nearly no fat, but over 1000 calories.

In [None]:
df[(df['calories'] > 1000) & (df['fat'] < 5)]