# Numpy and Pandas 

## Goals

- Introduce the numpy and pandas libraries.

Learning objectives:

      - Numpy arrays and its mathmateical abilities
      
      - Importing data into pandas using CSVs
      
      - Slicing and filtering pandas dataframes
      
      - Cleaning data
      
      - Statistics and other math with pandas



- Two common Python libraries used for statistical analysis, data munging/wrangling/transformation, and other mathematical purpose.

- To put it simply, there is your connection to the data. Pandas is the most important tool because you'll spend the most time and effort with it. 


### Numpy

Numpy has a wide ecosystem of functions and uses but for the purpose of this course we will focus on arrays aka numpy's version of a list

In [13]:
# Import library
import numpy as np

In [2]:
#Let's turn a list into an array

l = [3,2,6,7,9,1,2,-5]

array = l

In [3]:
#Call it
array

[3, 2, 6, 7, 9, 1, 2, -5]

In [6]:
#Call type
type(array)

list

How arrays differ from lists

In [None]:
#Does this code work



In [None]:
#What about this?


In [9]:
#Multiply l by 2 
l * 2

[3, 2, 6, 7, 9, 1, 2, -5, 3, 2, 6, 7, 9, 1, 2, -5]

In [10]:
#Multiply array by 2 
array * 2

[3, 2, 6, 7, 9, 1, 2, -5, 3, 2, 6, 7, 9, 1, 2, -5]

Numpy array have mathematical abilities that lists don't have, which makes them easier to use

In [14]:
#Mean value
array.mean()

AttributeError: 'list' object has no attribute 'mean'

In [12]:
l.mean()

AttributeError: 'list' object has no attribute 'mean'

In [None]:
#Maximum value
array.max()

In [None]:
#Mininum value
array.min()

In [None]:
#Sum all values
array.sum()

In [None]:
#Find standard deviation
array.std()

In [None]:
array.s

In [None]:
#What happens when you do this
dir(array)

Can also use numpy itself to call certain functions

In [15]:
#Median
np.median(array)

2.5

In [16]:
#Square
np.square(array)

array([ 9,  4, 36, 49, 81,  1,  4, 25])

In [17]:
#Square root
np.sqrt(array)

  


array([ 1.73205081,  1.41421356,  2.44948974,  2.64575131,  3.        ,
        1.        ,  1.41421356,         nan])

In [18]:
# Absolute value
np.abs(array)

array([3, 2, 6, 7, 9, 1, 2, 5])

In [None]:
# dir(np)

Arrays can also be multi-dimensional

In [19]:
#Make two dimensional numpy as with arange and reshape functions

np.arange(16)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [20]:
#Reshape array
arr_2d = np.arange(16).reshape(4,4)
arr_2d

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

<b>Slicing two dimension array<b>

In [None]:
#Slice rows
arr_2d[:2, :]

In [None]:
#Slice columns 
arr_2d[:,1:3]

In [None]:
#Slice both rows and columns
arr_2d[2:,1:3]

In [None]:
arr_2d

In [None]:
#Slice specific value
arr_2d[:,2]

Fantastic numpy tutorial here: https://www.datacamp.com/community/tutorials/python-numpy-tutorial

## Pandas

From <u>[Mastering Pandas](https://www.packtpub.com/big-data-and-business-intelligence/mastering-pandas)</u>

    The pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008. Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, a large community behind it, (220+ contributors and 9000+ commits by 03/2014), rapid iteration, features, and enhancements continuously made.
    
    • It can process a variety of data sets in different formats: time series, tabular heterogeneous, and matrix data.
    • It facilitates loading/importing data from varied sources such as CSV and DB/SQL.
    It can handle a myriad of operations on data sets: subsetting, slicing,  ltering, merging, groupBy, re-ordering, and re-shaping.
    • It can deal with missing data according to rules defined by the user/ developer: ignore, convert to 0, and so on.
    • It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.
    • It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.



In [21]:
#Import pandas library
import pandas as pd

In [22]:
#Create a pandas series.
series = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
series

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [23]:
#Return row
series["c"]

0.75

In [None]:
#Change value in the series
series["d"] = 2.0
series

In [None]:
#Turn python dictionary into pandas series
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
#Call California
population["California"]

In [25]:
#Turn python dictinonary into pandas data frame

data = {"feature_one" :[1,2,4,8,-3],
       "feature_two" : ["haight", "mission", "geary", "castro", " potrero"],
       "feature_three": [True, True, False, True, False]}
df = pd.DataFrame(data, index=[6,7,8,9,10])

df

Unnamed: 0,feature_one,feature_three,feature_two
6,1,True,haight
7,2,True,mission
8,4,False,geary
9,8,True,castro
10,-3,False,potrero


In [26]:
#Returns columns
df.columns

Index([u'feature_one', u'feature_three', u'feature_two'], dtype='object')

In [27]:
#Returns index
df.index

Int64Index([6, 7, 8, 9, 10], dtype='int64')

In [None]:
#Returns numpy array version of data frame. Also works on series.
df.values

In [28]:
#Call type on df
type(df)

pandas.core.frame.DataFrame

In [29]:
#Call feature_one column
f1 = df["feature_one"]
f1

6     1
7     2
8     4
9     8
10   -3
Name: feature_one, dtype: int64

In [30]:
#df.feature_one also works
df.feature_one

6     1
7     2
8     4
9     8
10   -3
Name: feature_one, dtype: int64

In [31]:
#Call type on f1
type(f1)

pandas.core.series.Series

In [32]:
#Select columns 
cols = ["feature_one", "feature_two"]
dff = df[cols]
dff

Unnamed: 0,feature_one,feature_two
6,1,haight
7,2,mission
8,4,geary
9,8,castro
10,-3,potrero


In [33]:
#Add new column to dataset

#Create new column feature_four by assigning it to number 4
df["feature_four"] = 4

#Create new column feature_five by assigning it to list ds
ds = ["Data", "Science", "Math", "Programming", "Hacking"]
df["feature_five"] = ds
df

Unnamed: 0,feature_one,feature_three,feature_two,feature_four,feature_five
6,1,True,haight,4,Data
7,2,True,mission,4,Science
8,4,False,geary,4,Math
9,8,True,castro,4,Programming
10,-3,False,potrero,4,Hacking


First dataset we will work with is the the drinks dataset

In [35]:
#File location of drinks dataset
path = "../../data/drinks.csv"

drinks = pd.read_csv(path)

In [36]:
#Head is used to view first 5 rows. 5 is default but can be changed.

drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [37]:
#Tail is for last five rows
drinks.tail()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF
192,Zimbabwe,64,18,4,4.7,AF


In [38]:
#Let's designate the country column as the index
drinks.set_index("country")


Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,AS
Albania,89,132,54,4.9,EU
Algeria,25,0,14,0.7,AF
Andorra,245,138,312,12.4,EU
Angola,217,57,45,5.9,AF
Antigua & Barbuda,102,128,45,4.9,
Argentina,193,25,221,8.3,SA
Armenia,21,179,11,3.8,EU
Australia,261,72,212,10.4,OC
Austria,279,75,191,9.7,EU


In [39]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [40]:
#How many rows and columns are there in this dataset?
drinks.shape

(193, 6)

General dataset information

In [41]:
#Lets look at this some details of this dataset
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB


In [42]:
drinks.describe()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


In [43]:
drinks.corr()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
beer_servings,1.0,0.458819,0.527172,0.835839
spirit_servings,0.458819,1.0,0.194797,0.654968
wine_servings,0.527172,0.194797,1.0,0.667598
total_litres_of_pure_alcohol,0.835839,0.654968,0.667598,1.0


What do these two commands do?

## Slicing dataframes

.loc

In [49]:
# Select values in Peru row
drinks.iloc["Peru"]

TypeError: cannot do positional indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [Peru] of <type 'str'>

In [None]:
#Select values in wine_servings column
# drinks.loc[:,"wine_servings"]

In [47]:
#Slice countries Germany to Guyana and columns beer_servings to wine_servings
drinks.loc["Germany": "Guyana", "beer_servings": "wine_servings"]

Unnamed: 0,beer_servings,spirit_servings,wine_servings


.iloc

In [None]:
#What do you think iloc does???

In [48]:
drinks.iloc[130:135, :2]

Unnamed: 0,country,beer_servings
130,Panama,285
131,Papua New Guinea,44
132,Paraguay,213
133,Peru,163
134,Philippines,71


In [50]:
#Returns row at index 48
drinks.iloc[48]

country                         Denmark
beer_servings                       224
spirit_servings                      81
wine_servings                       278
total_litres_of_pure_alcohol       10.4
continent                            EU
Name: 48, dtype: object

In [None]:
#Returns column at index 1
drinks.iloc[:, 1].head()

In [None]:
#Return slice of rows and columns
drinks.iloc[100:134, :-2]

In [None]:
drinks.iloc[:,[0,4]]

iloc slices dataframes using the integer index.

<b>Conditional selection<b>

What happens when you run the next two cells?

In [51]:
drinks.continent == "EU"

0      False
1       True
2      False
3       True
4      False
5      False
6      False
7       True
8      False
9       True
10      True
11     False
12     False
13     False
14     False
15      True
16      True
17     False
18     False
19     False
20     False
21      True
22     False
23     False
24     False
25      True
26     False
27     False
28     False
29     False
       ...  
163    False
164    False
165     True
166     True
167    False
168    False
169    False
170     True
171    False
172    False
173    False
174    False
175    False
176    False
177    False
178    False
179    False
180     True
181    False
182     True
183    False
184    False
185    False
186    False
187    False
188    False
189    False
190    False
191    False
192    False
Name: continent, Length: 193, dtype: bool

In [None]:
drinks.wine_servings > 20

Take those commands and pass them into the drinks data frame

In [None]:
drinks[drinks.continent == "EU"]

In [None]:
#Rows where wine_servings greater than 20
drinks[drinks.wine_servings > 20]

In [None]:
#What if we want non-European countries?
drinks[drinks.continent != "EU"]

`drinks.continent=='EU'` by itself returns a bunch of Trues and Falses.

When you wrap drinks around it with square brackets you're telling the drinks dataframe to select only those that are True, and not the False ones.


In [None]:
#Return a data frame where both conditions are true
drinks[(drinks.continent == "EU") & (drinks.wine_servings > 20)]

In [None]:
#Return data frame where either condition is true
drinks[(drinks.continent == "EU") | (drinks.wine_servings > 20)].shape

In [None]:
#Return rows where wine_serving is greater than beer_servings
drinks[drinks.wine_servings > drinks.beer_servings]

In [None]:
#Call .index to return just the countries
drinks[drinks.wine_servings > drinks.beer_servings].index

We can sum boolean values.

In [None]:
#How many countries consume no beer at all?
(drinks.beer_servings == 0).sum()

In [None]:
# drinks.to_csv("drinks2.csv")

<b>Pandas Series<b>

In [None]:
#Assign beer_servings to variable beer
beer = drinks["beer_servings"]
beer.head()

In [None]:
#Can do math operations similar to numpy arrays
#Multiply every value in beer by 
beer*2

In [None]:
#Add 2 to every value in beer
beer + 2

In [None]:
#Derive mean of beer
beer.mean()

In [None]:
#Derive median of beer
beer.median()

In [None]:
#Sum all values in beer
beer.sum()

In [None]:
#Pandas series can be added to one another
wine = drinks.wine_servings
beer + wine

In [None]:
#Create a new column call total_servings that is the sum of the beer, wine, and spirits columns
drinks["total_servivings"] = beer + wine + drinks.spirit_servings
drinks.head()

In [None]:
pd.read_sq

In [None]:
#Let's take a look at continent
cont = drinks.continent

In [None]:
#How many null values are there in continent
#First check to see which values are null
cont.isnull().sum()

In [None]:
cont.head(10)

In [None]:
#Replace every null with "No Continent"
cont.fillna("No Continent", inplace=True)

In [None]:
cont.isnull().sum()

`.fillna()` is great replacing all the null values in a numerical column with the mean of that column

In [None]:
#Drop every null value in cont
#cont.dropna(inplace=True)

`.isnull()`, `.fillna()`, and `.dropna()` work with data frames as well

In [None]:
#What are the continents in cont?
cont.unique()

In [None]:
#How many unique values there are
cont.nunique()

In [None]:
#How many countries are from each continent?
cont.value_counts()

In [None]:
#What percentage of the data belongs to each continent
cont.value_counts(normalize=True)

Lets go back to drinks data frame

In [None]:
drinks.columns

In [None]:
drinks.rename(columns={"total_servivings":"total_servings"}, inplace=True)

In [None]:
#Show me the top 5 booziezt countries
drinks.sort_values(by="total_servings", ascending=False).head()

Does that seem right?

We're forgetting something

In [None]:
drinks.sort_values(by="total_servings", ascending=False).head().index

Sorting in python defaults to going from least to greatest

In [None]:
#This also works (kinda)
drinks.sort_values(by="total_servings").tail().index

In [None]:
#Sort values in a series
# beer.sort_values(ascending=False)

### Exercise time

1. Which countries drink more spirits than beer?

2. Find the top five booziest countries of each continent, do not include "No continent." Try using dictionary and for loop.

In [None]:
#Answer 1.
drinks[drinks.spirit_servings > drinks.beer_servings].index


In [None]:
#Answer 2.

top5_cont = {}

for i in drinks.continent.unique():
    if i != "No Continent":
        cont_df = drinks[drinks.continent == i]
        top5 = cont_df.sort_values(by="total_servings", ascending=False).head().index.tolist()
        top5_cont[i] = top5
top5_cont
        

## Groupby

**Split Apply Combine**

<img src="https://www.safaribooksonline.com/library/view/learning-pandas/9781783985128/graphics/5128OS_09_01.jpg">

In [None]:
#Group by continent
drinks.groupby("continent")

In [None]:
#Data needs to be accessed by certain methods
#Call .mean()
drinks.groupby("continent").mean()

In [None]:
#Call .median()
drinks.groupby("continent").median()

In [52]:
#What happens when you do .describe()
drinks.groupby("continent").describe()

Unnamed: 0_level_0,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,spirit_servings,spirit_servings,...,total_litres_of_pure_alcohol,total_litres_of_pure_alcohol,wine_servings,wine_servings,wine_servings,wine_servings,wine_servings,wine_servings,wine_servings,wine_servings
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0,53.0,16.339623,...,4.7,9.1,53.0,16.264151,38.846419,0.0,1.0,2.0,13.0,233.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0,44.0,60.840909,...,2.425,11.5,44.0,9.068182,21.667034,0.0,0.0,1.0,8.0,123.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0,45.0,132.555556,...,10.9,14.4,45.0,142.222222,97.421738,0.0,59.0,128.0,195.0,370.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0,16.0,58.4375,...,6.15,10.4,16.0,35.625,64.55579,0.0,1.0,8.5,23.25,212.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0,12.0,114.75,...,7.375,8.3,12.0,62.416667,88.620189,1.0,3.0,12.0,98.5,221.0


In [54]:
#Call specific column on groupby object and find the minimum value
drinks.groupby("continent")["beer_servings"].count()

continent
AF    53
AS    44
EU    45
OC    16
SA    12
Name: beer_servings, dtype: int64

In [None]:
#Find the max
drinks.groupby("continent")["beer_servings"].max()

End of drinks data. Any questions before we move on?

In [None]:
#Read in chiptole dataset

path = "../data/chipotle.tsv"
#Use read_table instead read_csv
chip = pd.read_table(path)
chip.head()

We can see that there are some NaN present in the data set. Let's look at how many there are in each column.

In [None]:
#Call .isnull() and then call .sum()
chip.isnull().sum().sum()

In [None]:
#What happens when you tack on another .sum()?


We need to fix the price column before converting it to a float

In [None]:
#Whats is item_price type?
chip.item_price.dtype

Pandas series have a string (`str`) method that lets you treat a column like a string

In [None]:
#Call .str
chip.item_price.str.replace("$", "")

In [None]:
#Replace $ with empty string and overwrite item_price column
chip["item_price"] = chip.item_price.str.replace("$", "")

In [None]:
#Change the type of column from object to float and overwrite item_price column
chip["item_price"] = chip.item_price.astype(float)

In [None]:
chip.head()

In [None]:
chip.dtypes

More examples of using the `str`

In [None]:
chip.item_name.str.capitalize()

In [None]:
chip.item_name.str.lower()

In [None]:
chip.item_name.str.len()

We know how to drop rows with null values but how do we drop columns with null values?

In [None]:
#Drop columns with null values
# chip.dropna(axis= 1)

Remember `chip.isnull().sum()`? 

In [None]:
#Set axis = 1 in .sum()
chip.isnull().sum(axis=1).sum()

Axis refers to which direction you wish your command to follow. 1 == columns, 0 = rows. 0 is the default.

This is handy when it comes to dropping rows and columns

In [None]:
#Lets get rid of the item_name column

chip.drop("item_name", axis=1)

Why didn't this work?

Forgot to do axis = 1 !!

Let's try it again

In [None]:
chip.drop("item_name")

In [None]:
#Make it permanent
# chip.drop("item_name", axis = 1, inplace=True)

If you want to drop rows pass index name or list of index names

In [None]:
chip.drop([0, 2, 4, 6, 8])

Move on from Chipotle to Movie metadata

In [None]:
#Load in movie_metadata dataset
path = "../data/movie_metadata.csv"

movies = pd.read_csv(path)
movies.head()

In [None]:
movies.shape

In [None]:
movies.columns

In [None]:
#Check out the info
movies.info()

In [None]:
#Replace nulls in budget with median of budget
movies.budget.fillna(movies.budget.median(), inplace=True)

In [None]:
#Drop nulls permanently
movies.dropna(inplace=True)

In [None]:
#How many rows are ther post null dropping?
movies.shape

Filter out some data. Let's only look at movies that are rated G, PG, PG-13, and R.
We could do it this way movies[(movies.content_rating=="G") etc...]

But let's not!

In [None]:
#List called ratings with the four ratings we want to use
ratings = ["G", "PG", "PG-13", "R"]

In [None]:
movies.content_rating.unique()

In [None]:
#Now lets check to see if a value in content "is in" ratings
movies[movies.content_rating.isin(ratings)]

In [None]:
#Return new data frame with movies with a rating that is in ratings
movies = movies[movies.content_rating.isin(ratings)]

In [None]:
#check shape
movies.shape

## Mapping and applying functions

In [None]:
#Make a new column called profitable by subtracting budget from gross
movies["profit"] = movies.gross -movies.budget
movies.profit.head()

We want to know if a movie is profitable. How can we go about doing that?
With some python!

In [None]:
#Make a function that returns "profitable" for profit > 0 and "loss" for profit <= 0
def profit_decider(x):
    if x > 0:
        return "profitable"
    else:
        return "loss"

In [None]:
#Apply the function onto the profit column
# movies.profit.apply(profit_decider)

In [None]:
#Make new column called "profitable"
movies["profitable"] = movies.profit.apply(profit_decider)

In [None]:
#How many movies are profitable and not profitable
movies.profitable.value_counts()

Use lambda functions

In [None]:
#Return 1 for movies directed by Christopher Nolan
# movies.director_name.apply(lambda x: 1 if x == "Christopher Nolan" else 0)

`.apply()` allows us to transform values based on rules set by a function.

`.map()` allows us to use a dictionary to directly change values

In [None]:
#Example of changinge male to m and female to f
data.gender.map({"male":"m", "female":"f"})

A parent wants to make a new column called kid friendly where movies that G or PG are labelled "KF" and PG-13 and R movies are labelled "NKF"

In [None]:
#Map a dictionary onto the content_rating column and create new column called "kid_friendly"
kf_dict = {"G" : "KF",
          "PG" : "KF",
          "PG-13" : "NKF",
          "R" : "NKF"}


movies["kid_friendly"] = movies.content_rating.map(kf_dict)

In [None]:
movies.kid_friendly.value_counts()

Remember that you must account for every unique value or you will get null values

## Class work exercises

1) What percent of movies are directed by James Cameron?

2) What are the correlations among budget, gross, and imdb_score?

3) How many PG-13 movies has Robert De Niro starred in?

4) How much money have non-English films generated?

5) Who are the top five grossing directors on average?

5b) Now only look at directors who have directed more than five films

6) How many movies contain "Action" in the genre column? What about Comedy? What about both Comedy and Action?

If you finish the exercises before the end of class, then further investigate the movies dataset on your own or any of the other sets in the data folder.

Be sure to check out any of these pandas resources

- Pandas cheatsheets in the extracurricular directory

- https://chrisalbon.com/ <- Great website. For now check out data wrangling section.

- Data school's collection of resources http://www.dataschool.io/best-python-pandas-resources/

- Data school tutorial in giant repo. http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb

- http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
- http://nbviewer.jupyter.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_1-Introduction-to-Pandas.ipynb

- Repo with Pandas exercises https://github.com/guipsamora/pandas_exercises

- https://github.com/brandon-rhodes/pycon-pandas-tutorial

- https://github.com/jonathanrocher/pandas_tutorial

- https://github.com/chendaniely/scipy-2017-tutorial-pandas

- https://github.com/adeshpande3/Pandas-Tutorial



Youtube is your friend!! They are too many pandas tutorials videos to count.

<b>If you find a good resource be sure to share it with the rest of the class<b>