# Pandas 101

Pandas is a very popular and ultra useful Python library that is used extensively for data manipulation. 

It can take data in many forms, such as from lists (or lists of lists), dictionaries, or even external files on your computer such as excel, CSVs etc. 

While learning Pandas, another library, numpy, will also be introduced to perform numerous functions with datasets.

**A Word on Libraries**

A library is a set of functions pre-written by someone that can be 'called' and used for their pre-defined purpose. Everything you have used and learnt so far is called the Python Standard Library, which comes pre-packaged. Pandas is an example of a library made with the Python Standard Library, that can be 'called' or imported  and it's useful predefined functions used with a lot less code.

## Importing Pandas

In [4]:
# Let's first import pandas

import pandas as pd # We rename pandas to pd so we don't have to write pandas. before every function that we call from Pandas

## Making a DataFrame...

A DataFrame object is basically a table in this sense. It acts and looks like any standard excel or SQL table. The magic is in taking different types of data and making them into tables so that it's easier for humans to read/use the data.

### ...From a List

In [5]:
dataList = [[1251,65342,164323,'What'],[635,421,1125,'Who'],[12412,8585,85434,'Where']]

df = pd.DataFrame(dataList)

In [6]:
df.head() # Displaying your list in data form


Unnamed: 0,0,1,2,3
0,1251,65342,164323,What
1,635,421,1125,Who
2,12412,8585,85434,Where


In [7]:
# Things to look out for: all sublists must be of the same length to input into a dataframe

In [8]:
# Our column names aren't very descriptive though? What can we do?
# Let's learn how to change them

colNames = ['A','B','C','D']

df.columns = colNames
df.head() #Voila!

Unnamed: 0,A,B,C,D
0,1251,65342,164323,What
1,635,421,1125,Who
2,12412,8585,85434,Where


## ....From a Dictionary

Let's say you have a dictionary of dates and another value, say average temperature for that day

In [9]:
tempDict = {'2017-09-10':35.5,'2017-09-11':33.3,'2017-09-12':31.0,'2017-09-13':34.6}

tempDF = pd.DataFrame(list(tempDict.items()),columns = ['Date','Average Temperature'])

# Notice that all you're really doing is extracting the key/value pairs and turning them into lists before passing into pandas

In [14]:
tempDF.head(1)

Unnamed: 0,Date,Average Temperature
0,2017-09-10,35.5


### ... From an Excel File

This is something that would be most useful in real life. You have a dataset in excel somewhere, and you want to use Python to analyse it. 

In [16]:
cars = pd.read_excel('cars.xls') # If you keep it in the same folder as where your ipython notebook is, you just need the name
# Otherwise, you have to provide a path to the file from your C drive in Windows, or /home/ folder in OSX

cars.head()

!cd

C:\work-py\workspace-01\machine-learning\master\101-foundation\W1.2 Pandas101


### ....From a CSV File

A CSV file is called a comma seperated file and looks like this:

ColA,ColB,ColC,ColD
<br>
1325,3115,6435,1526

Every column is seperated by a comma, and a row is seperated be a new line. 

This is probably the most common form of a dataset that you will come across a lot in this class, and in real life.

**Note:** The column separator doesn't have to be a comma, but can be any character. But remember that commonly occuring characters are a bad idea since they can occur in normal text data, and in that case, there may be conflicts with column changes

In [17]:
mov = pd.read_csv('movie_metadata.csv')
mov.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


### Data Exploration

Now we know how to import datasets, let's learn some very useful Pandas functions.

In [None]:
# the head function that outputs the first 5 rows

mov.head()

In [47]:
# You can put in a number inside the brackets after head to see more rows

mov.head(20)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
7,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
8,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000
10,Color,Zack Snyder,673.0,183.0,0.0,2000.0,Lauren Cohan,15000.0,330249062.0,Action|Adventure|Sci-Fi,...,3018.0,English,USA,PG-13,250000000.0,2016.0,4000.0,6.9,2.35,197000


In [19]:
# To check the bottom of the dataset, use the tail function. 
# It works exactly the same way as the head function

mov.tail()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
5038,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
5039,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.0,32000
5040,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
5041,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660
5042,Color,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,...,84.0,English,USA,PG,1100.0,2004.0,23.0,6.6,1.85,456


In [18]:
# Check the dimensions of the dataset by using the shape function

mov.shape  # Our dataset has 5043 rows and 28 columns

(5043, 28)

In [30]:
# The numbers you see on the left most column, that doesn't have a column header is called index
# It helps makes rows unqiue and is in increments of 1 for an untampered dataset
# A good way to tell if rows have been deleted during your analysis (by mistake or not) is to check the increment

mov.index
type(mov.index)
RangeIndex?

Object `RangeIndex` not found.


### Getting Familiar with your Dataset

A starting part of exploratory analysis is to get familiar with the dataset.

1) What are the dimensions of the dataset? 
<br>
2) What kind of columns do you have?
<br>
3) What datatypes are these columns?
<br>
4) Are there lots of missing or Null values?

In [None]:
# Gettins just column names out of a DataFrame

col_names = list(mov)
print(col_names)

In [None]:
# Checking datatypes for your columns

mov.dtypes

# Object type is just text

In [27]:
# Pandas has a nifty functionality to  summarise columns automatically for you, to give you a dataset snapshot

mov.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4993.0,5028.0,4939.0,5020.0,5036.0,4159.0,5043.0,5043.0,5030.0,5022.0,4551.0,4935.0,5030.0,5043.0,4714.0,5043.0
mean,140.194272,107.201074,686.509212,645.009761,6560.047061,48468410.0,83668.16,9699.063851,1.371173,272.770808,39752620.0,2002.470517,1651.754473,6.442138,2.220403,7525.964505
std,121.601675,25.197441,2813.328607,1665.041728,15020.75912,68452990.0,138485.3,18163.799124,2.013576,377.982886,206114900.0,12.474599,4042.438863,1.125116,1.385113,19320.44511
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
25%,50.0,93.0,7.0,133.0,614.0,5340988.0,8593.5,1411.0,0.0,65.0,6000000.0,1999.0,281.0,5.8,1.85,0.0
50%,110.0,103.0,49.0,371.5,988.0,25517500.0,34359.0,3090.0,1.0,156.0,20000000.0,2005.0,595.0,6.6,2.35,166.0
75%,195.0,118.0,194.5,636.0,11000.0,62309440.0,96309.0,13756.5,2.0,326.0,45000000.0,2011.0,918.0,7.2,2.35,3000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,12215500000.0,2016.0,137000.0,9.5,16.0,349000.0


In [31]:
# Get a snapshot of a random row by using the iloc function

mov.iloc[124]

color                                                                    Color
director_name                                                   Lana Wachowski
num_critic_for_reviews                                                     245
duration                                                                   129
director_facebook_likes                                                      0
actor_3_facebook_likes                                                     233
actor_2_name                                                       Collin Chou
actor_1_facebook_likes                                                     309
gross                                                               1.3926e+08
genres                                                           Action|Sci-Fi
actor_1_name                                                       Essie Davis
movie_title                                            The Matrix Revolutions 
num_voted_users                                     

In [32]:
# This is really useful in getting comfortable with your dataset. 
# But another thing we might want to know is...how many unique values are there in a column?
# Example: how many different genres are in the dataset? 
# Answer: Numpy.unique

import numpy as np

# To take only a specific column out, we use the following notation df['column name']

uniqueGenres = np.unique(mov['genres'])
print(len(uniqueGenres))
print(uniqueGenres)

914
['Action' 'Action|Adventure'
 'Action|Adventure|Animation|Comedy|Crime|Family|Fantasy'
 'Action|Adventure|Animation|Comedy|Drama|Family|Fantasy|Thriller'
 'Action|Adventure|Animation|Comedy|Drama|Family|Sci-Fi'
 'Action|Adventure|Animation|Comedy|Family'
 'Action|Adventure|Animation|Comedy|Family|Fantasy'
 'Action|Adventure|Animation|Comedy|Family|Fantasy|Sci-Fi'
 'Action|Adventure|Animation|Comedy|Family|Sci-Fi'
 'Action|Adventure|Animation|Comedy|Fantasy'
 'Action|Adventure|Animation|Comedy|Fantasy|Sci-Fi'
 'Action|Adventure|Animation|Comedy|Sci-Fi'
 'Action|Adventure|Animation|Drama|Fantasy|Sci-Fi'
 'Action|Adventure|Animation|Drama|Mystery|Sci-Fi|Thriller'
 'Action|Adventure|Animation|Family'
 'Action|Adventure|Animation|Family|Fantasy'
 'Action|Adventure|Animation|Family|Fantasy|Sci-Fi'
 'Action|Adventure|Animation|Family|Sci-Fi'
 'Action|Adventure|Animation|Family|Sci-Fi|Thriller'
 'Action|Adventure|Animation|Fantasy'
 'Action|Adventure|Animation|Fantasy|Romance|Sci-Fi'
 'Act

## In-Class Activity #1: 

Find the number of distinct Directors & Actors (actor_1_name) in the above datasaet. 

In [49]:
# Your code below
mov = mov.dropna(how='any')
uniqueDirectorActor = np.unique(mov['actor_1_name'])
print(len(uniqueDirectorActor))

print(len(list(np.unique(mov['actor_1_name']))))
print(len(list((mov.actor_1_name.unique()))))

mov['movie_title'].head()


1428
1428
1428


0                                      Avatar 
1    Pirates of the Caribbean: At World's End 
2                                     Spectre 
3                       The Dark Knight Rises 
5                                 John Carter 
Name: movie_title, dtype: object

### Sorting by columns

Let's say  you want to sort this by IMDB Score. Pandas makes this really easy to do

In [53]:
mov.sort_values(by='imdb_score', ascending = False)['movie_title']

1937                            The Shawshank Redemption 
3466                                       The Godfather 
66                                       The Dark Knight 
2837                              The Godfather: Part II 
339        The Lord of the Rings: The Return of the King 
4498                      The Good, the Bad and the Ugly 
1874                                    Schindler's List 
3355                                        Pulp Fiction 
836                                         Forrest Gump 
270     The Lord of the Rings: The Fellowship of the R...
97                                             Inception 
2051      Star Wars: Episode V - The Empire Strikes Back 
683                                           Fight Club 
4029                                         City of God 
340                The Lord of the Rings: The Two Towers 
3024                  Star Wars: Episode IV - A New Hope 
3867                     One Flew Over the Cuckoo's Nest 
1903          

In [41]:
mov.iloc[1937]

color                                                                    Color
director_name                                                    Paul McGuigan
num_critic_for_reviews                                                     159
duration                                                                   110
director_facebook_likes                                                    118
actor_3_facebook_likes                                                     287
actor_2_name                                                   Spencer Wilding
actor_1_facebook_likes                                                   11000
gross                                                              5.77352e+06
genres                                            Drama|Horror|Sci-Fi|Thriller
actor_1_name                                                  Daniel Radcliffe
movie_title                                               Victor Frankenstein 
num_voted_users                                     

### In-Class Activity #2: Sort by budget, ascending


In [81]:
# Your code below
mov.sort_values(by='budget')[['director_name','actor_1_name']]

Unnamed: 0,director_name,actor_1_name
4799,Jonathan Caouette,Greg Ayres
5042,Jon Gunn,John August
5026,Olivier Assayas,Maggie Cheung
5035,Robert Rodriguez,Carlos Gallardo
5033,Shane Carruth,Shane Carruth
5025,John Waters,Divine
5027,Jafar Panahi,Fereshteh Sadre Orafaiy
4311,Hunter Richards,Jason Statham
4793,Oren Peli,Micah Sloat
5015,Richard Linklater,Tommy Pallotta


### Advanced Manipulation Techniques

Now let's start doing some fun stuff, that we've become familiar with the basic functions. We will continue using the IMDB dataset

In [None]:
# Filtering by column values:
# Let's say you only wanted to see only Steven Spielberg films
# How would you do that?

spielberg = mov[mov.director_name == 'Steven Spielberg']

# Another alternate way to do this would be:
# spielberg = mov[mov['director_name'] == 'Steven Spielberg']

In [None]:
spielberg.head()

In [None]:
# Let's say you want to convert the color column to ordinal types (1s & 0s)
# Why? Maybe you want to use it in your analysis somehow (example for later, Logistic Regression)
# Before doing that, let's make sure there are only those two unique values in the column

colorUq = np.unique(mov['color'])

In [69]:
# Well well well... if you google the error, you'll find that the reason seems to be because of null values
# Before we change the underlying data, let's first make a copy of the original dataset
mov2 = mov.copy()

# Filling in Nulls values:
# Let's remove the null values only from the color row

mov2 = mov2.dropna(how='any')

# colorUq = np.unique(mov2['color'])
mov2.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000


In [70]:
colorUq = np.unique(mov2['color'])
print(len(colorUq)) # There you go, now we only have two values obviously B/W and Color.
print(colorUq)

2
[' Black and White' 'Color']


In [71]:
# We can't really use this still though for most cases. What if we wanted to change it to ordinal data type?
# i.e. Color =1 and B/W = 0 

colorDict = {'Color':1,' Black and White':0}
mov2['color'] = mov2['color'].map(colorDict) # the map function maps the dictionary to the entire column
mov2.head()



Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,1,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,1,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,1,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,1,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,1,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000


In [72]:
print(np.unique(mov2['color'])) # All working now 


[0 1]


In [74]:
map?

In [78]:
# A note on map: it works in the following structure for lists/dictionaries: map(func,datastore)

def square(x):
    return x**2

y = [1,2,3]
z = map(square,y)
print(type(z))
print(list(z))

<class 'map'>
[1, 4, 9]


In [None]:
# Lambda functions are really really useful for numerous reasons.
# They are used when you only want to do something once, and don't want to make a predefined function for it
# In Pandas, they are used to do something to every row for a particular column

# For example, let's say you wanted to get the duration in seconds and not minutes for the movie dataset
# You can use the lambda function to iterate through the entire datatset and apply that transformation using the map functionality
# Let's see how to use it:

mov2['duration'] = mov2['duration'].map(lambda x: x * 60)
mov2.head()

In [None]:
# What if you wanted to keep the original column untouched? 
# Let's convert back to minutes now
# We can create a new column like this and apply the inverse function

mov2['duration_mins'] = mov2['duration']/60
mov2.head()

### Calculating Metrics

We can do some more exploration of the dataset yet. 
<br>
What if we wanted to find max/min/mean etc of columns? 
<br>
How can we group by colunmns?


In [None]:
# Calculating maximums and minimums

mov2['budget'].max()

In [79]:
# Wait, this sounds really wrong. 
# 12 billion USD budget? Let's investigate more
# We can use the iloc function, and use filtering to just store the row by the maximum budget

row = mov2[mov2['budget']==mov2['budget'].max()]
row.iloc[0]

color                                                                        1
director_name                                                     Joon-ho Bong
num_critic_for_reviews                                                     363
duration                                                                   110
director_facebook_likes                                                    584
actor_3_facebook_likes                                                      74
actor_2_name                                                      Kang-ho Song
actor_1_facebook_likes                                                     629
gross                                                              2.20141e+06
genres                                              Comedy|Drama|Horror|Sci-Fi
actor_1_name                                                         Doona Bae
movie_title                                                          The Host 
num_voted_users                                     

In [None]:
# Anyways, let's move on

mov2['budget'].min()

In [None]:
mov2[mov2['budget']==mov2['budget'].min()].iloc[0]

In [None]:
# Calculating the mean for budget
mov2['budget'].mean()

In [86]:
# Lastly, let's see if we can use Pandas to get the average budget and gross by actor
roi = mov2[['director_name','actor_1_name','gross','budget']]
roi = roi.groupby(['director_name','actor_1_name']).mean()

roi.sort_values(by='gross')

roi = roi.reset_index()   # must do this before we sort by director name
roi.sort_values(by='director_name')

Unnamed: 0,director_name,actor_1_name,gross,budget
0,Aaron Schneider,Bill Murray,9176553.0,7500000.0
1,Aaron Seltzer,Alyson Hannigan,48546578.0,20000000.0
2,Abel Ferrara,Isabella Rossellini,1227324.0,12500000.0
3,Adam Goldberg,Judy Greer,2580.0,1650000.0
4,Adam Marcus,Kane Hodder,15935068.0,2500000.0
5,Adam McKay,Darcy Donavan,84136909.0,26000000.0
6,Adam McKay,Dwayne Johnson,119219978.0,100000000.0
7,Adam McKay,Harrison Ford,2175312.0,50000000.0
8,Adam McKay,Ryan Gosling,70235322.0,28000000.0
9,Adam McKay,Will Ferrell,124341085.0,69000000.0


In [None]:
roi.sort_values(by='gross',ascending = False)

## Taking things out of a DataFrame

Now we have covered some basic data exploration, it's time to see how we can export data from a Pandas DataFrame.

### Column --> List

Let's say you just want to take a column out to a list

In [None]:
directors = mov2['director_name'].tolist()

In [None]:
print(director)

In [87]:
# 2nd more customised
# Say you only wanted imdb scores of only James Cameron's films, with the film names

cameron = []

cameron.append(mov2.loc[mov2['director_name'] == 'James Cameron','movie_title'].tolist())
cameron.append(mov2.loc[mov2['director_name']== 'James Cameron','imdb_score'].tolist())

print(cameron)

[['Avatar\xa0', 'Titanic\xa0', 'Terminator 2: Judgment Day\xa0', 'True Lies\xa0', 'The Abyss\xa0', 'Aliens\xa0', 'The Terminator\xa0'], [7.9, 7.7, 8.5, 7.2, 7.6, 8.4, 8.1]]


### DataFrame to CSV

This is also a super useful function. Let's say you imported a data set in python, cleaned it and did some blending, slicing and manipulation. Now you want to take it back out. Pandas has built-in functions capable of doing that.

**General Syntax Below:**

df.to_csv("{your file path here}", sep = "{what seperator you want, e.g. comma}", index = {True or False: if you want to keep the index})

In [None]:
# Let's try it:
mov2.to_csv('movies.csv' , sep=',' , index = False)

## In-Class Activity #3

What genre combinations makes the most money? See if you can group by genre on gross and sort by descending to find out what the top 3 grossing genre combinations are.

In [98]:
# Reading a new & fresh dataset
mov = pd.read_csv('movie_metadata.csv')

# YOUR CODE BELOW
genreGross = mov[['genres','gross']]
genreGross = genreGross.groupby(['genres']).sum()
genreGross = genreGross.sort_values(by='gross', ascending=False)

genreGross.head(3)





Unnamed: 0_level_0,gross
genres,Unnamed: 1_level_1
Action|Adventure|Sci-Fi,9290409000.0
Comedy,6734155000.0
Comedy|Romance,6371779000.0


## In-Class Activity #4

Let's dive in deeper now. Are the 2nd actors (actor_2_name) who were part of the top 10 highest grossing movies also a part of the top 10 2nd actors with the most facebook likes?

In [107]:
mov = pd.read_csv('movie_metadata.csv')

# Your code below
t = mov[['actor_2_name', 'gross', 'actor_2_facebook_likes']]
t = t.sort_values(by='gross', ascending=False)
top10Gross = list(t.head(10)['actor_2_name'])

t = t.sort_values(by='actor_2_facebook_likes', ascending=False)
top10FacebookLikes = list(t.head(10)['actor_2_name'])

print(top10Gross)
print(top10FacebookLikes)
print(list(set(top10Gross) & set(top10FacebookLikes)))


['Joel David Moore', 'Kate Winslet', 'Judy Greer', 'Robert Downey Jr.', 'Robert Downey Jr.', 'Heath Ledger', 'Liam Neeson', 'Peter Cushing', 'Robert Downey Jr.', 'Christian Bale']
['Andrew Fiscella', 'Leonardo DiCaprio', 'Tom Hardy', 'Tom Hardy', 'Alan Rickman', 'Alan Rickman', 'Alan Rickman', 'Christian Bale', 'Christian Bale', 'Paul Walker']
['Christian Bale']



## In-Class Activity #5 (Bonus)

Find the average imdb_ratings by director, take it out into a list, write to a dictionary (director_name being the key and average rating being the value)?

<br>
<br>
This problem reqruies using a mix of Pandas and the Python Standard Library.

In [114]:
mov = pd.read_csv('movie_metadata.csv')

# Your code below
imdb = mov[['director_name','imdb_score']]
imdb = imdb.groupby(['director_name']).mean()
imdbList = list(imdb)

print(imdbList)










['imdb_score']


## In-Class Activity #6

Now let's use a new dataset: Human Resource Analytics. There are no exact questions I want to put forward here, but rather, it's an opportunity for you guys to follow some open ended directions and take this where you would want. 

Some things to explore:

1) Are there any general patterns in people that are leaving that are very noticeable? 
<br>
2) Are there any general patterns in people who are staying that are very noticeable?
<br>
3) Do you see any effects on salary or on employee's evaluation scores of how much an employee works?
<br>
4) How would you try to predict if an employee will leave? What columns/combination of columns would you use?