# Preamble

Theodore Petrou. “Pandas Cookbook: Recipes for Scientific Computing, Time Series Analysis and Data Visualization using Python.” Apple Books.

Good references: 

https://towardsdatascience.com/bringing-the-best-out-of-jupyter-notebooks-for-data-science-f0871519ca29

https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6

One key shortcut --> Switching between ```code``` and ```markdown``` in a cell

To get to markdown from a code cell, key sequence is ```Esc```, ```M```, ```Enter```

Substitute Y for M to switch back to code

In [1]:
#Preamble
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Chapter 1: Foundations
## Covering basics of Pandas from data types to common operations in DataFrame Environment

In [2]:
movie = pd.read_csv('data/movie.csv')
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [3]:
#Assigning index, columns, and values object of Data Frame to individual variables and viewing them 
#and their data type
index = movie.index
columns = movie.columns
data = movie.values

In [4]:
index

RangeIndex(start=0, stop=4916, step=1)

In [5]:
columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [6]:
data

array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

In [7]:
type(data)

numpy.ndarray

In [8]:
type(index)

pandas.core.indexes.range.RangeIndex

In [9]:
type(columns)

pandas.core.indexes.base.Index

In [10]:
#Index and Columns variables appear to be a related data class type
issubclass(pd.RangeIndex, pd.Index)

True

In [11]:
#Demonstrating that nd_arrays are the foundation for all Pandas objects
index.values

array([   0,    1,    2, ..., 4913, 4914, 4915])

In [12]:
#check each column for it's data type 
movie.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

In [13]:
movie.get_dtype_counts()

float64    13
int64       3
object     12
dtype: int64

In [14]:
#Extracting Series from DataFrame --> 1) Index Method
a = movie['director_name']
print(a)
type(a)

0            James Cameron
1           Gore Verbinski
2               Sam Mendes
3        Christopher Nolan
4              Doug Walker
5           Andrew Stanton
6                Sam Raimi
7             Nathan Greno
8              Joss Whedon
9              David Yates
10             Zack Snyder
11            Bryan Singer
12            Marc Forster
13          Gore Verbinski
14          Gore Verbinski
15             Zack Snyder
16          Andrew Adamson
17             Joss Whedon
18            Rob Marshall
19        Barry Sonnenfeld
20           Peter Jackson
21               Marc Webb
22            Ridley Scott
23           Peter Jackson
24             Chris Weitz
25           Peter Jackson
26           James Cameron
27           Anthony Russo
28              Peter Berg
29         Colin Trevorrow
               ...        
4886            Eric Eason
4887              Uwe Boll
4888     Richard Linklater
4889       Joseph Mazzella
4890          Travis Legge
4891         Alex Kendrick
4

pandas.core.series.Series

In [15]:
#Extracting Series From a DataFrame --> 2) Dot Notation
#Note: Dot notation will fail if column name contains special characters --> Index operator preferrable for me
a = movie.director_name
print(a)
type(a)

0            James Cameron
1           Gore Verbinski
2               Sam Mendes
3        Christopher Nolan
4              Doug Walker
5           Andrew Stanton
6                Sam Raimi
7             Nathan Greno
8              Joss Whedon
9              David Yates
10             Zack Snyder
11            Bryan Singer
12            Marc Forster
13          Gore Verbinski
14          Gore Verbinski
15             Zack Snyder
16          Andrew Adamson
17             Joss Whedon
18            Rob Marshall
19        Barry Sonnenfeld
20           Peter Jackson
21               Marc Webb
22            Ridley Scott
23           Peter Jackson
24             Chris Weitz
25           Peter Jackson
26           James Cameron
27           Anthony Russo
28              Peter Berg
29         Colin Trevorrow
               ...        
4886            Eric Eason
4887              Uwe Boll
4888     Richard Linklater
4889       Joseph Mazzella
4890          Travis Legge
4891         Alex Kendrick
4

pandas.core.series.Series

The previous block [22] highlights a key feature of Pandas and Ipython: dot notation won't work if column names contain special characters. 

**For my arc data, I'll need to rename all columns, or I'll need to use index notation only**

In [16]:
#Converting a Series to a data frame (1D) is easy!
director_df = a.to_frame()
print(director_df)

           director_name
0          James Cameron
1         Gore Verbinski
2             Sam Mendes
3      Christopher Nolan
4            Doug Walker
5         Andrew Stanton
6              Sam Raimi
7           Nathan Greno
8            Joss Whedon
9            David Yates
10           Zack Snyder
11          Bryan Singer
12          Marc Forster
13        Gore Verbinski
14        Gore Verbinski
15           Zack Snyder
16        Andrew Adamson
17           Joss Whedon
18          Rob Marshall
19      Barry Sonnenfeld
20         Peter Jackson
21             Marc Webb
22          Ridley Scott
23         Peter Jackson
24           Chris Weitz
25         Peter Jackson
26         James Cameron
27         Anthony Russo
28            Peter Berg
29       Colin Trevorrow
...                  ...
4886          Eric Eason
4887            Uwe Boll
4888   Richard Linklater
4889     Joseph Mazzella
4890        Travis Legge
4891       Alex Kendrick
4892       Marcus Nispel
4893     Brandon Landers


Over 450 + operations possible on Series and DataFrame objects

Going to walk throguh some common Series methods, many of which have equivalents for DataFrame objects

In [17]:
#Create series with different data types
director = movie['director_name']
actor_1_fb_likes = movie['actor_1_facebook_likes']

In [18]:
#Inspect .head() of each series
director.head()

0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

In [19]:
actor_1_fb_likes.head()

0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

In [20]:
#One of the most useful operations for object dtypes: .value_counts(), which gives you the number of occurences
#of unique values

director.value_counts()

Steven Spielberg             26
Woody Allen                  22
Martin Scorsese              20
Clint Eastwood               20
Ridley Scott                 16
Spike Lee                    16
Steven Soderbergh            15
Renny Harlin                 15
Tim Burton                   14
Oliver Stone                 14
Barry Levinson               13
Robert Zemeckis              13
Ron Howard                   13
Robert Rodriguez             13
Joel Schumacher              13
Brian De Palma               12
Kevin Smith                  12
Tony Scott                   12
Michael Bay                  12
Sam Raimi                    11
Richard Linklater            11
Chris Columbus               11
Francis Ford Coppola         11
Shawn Levy                   11
Richard Donner               11
Rob Reiner                   11
David Fincher                10
Stephen Frears               10
Paul W.S. Anderson           10
John McTiernan               10
                             ..
Bruce Ca

In [21]:
#For numeric data, like Facebook likes, number of likes is binned to nearest thousand. Note how it is not given in
#precise numerical order; one of the features of Python that requires other functions to address
actor_1_fb_likes.value_counts()

1000.0     436
11000.0    206
2000.0     189
3000.0     150
12000.0    131
13000.0    123
14000.0    120
10000.0    109
18000.0    106
22000.0     80
15000.0     71
23000.0     55
16000.0     55
4000.0      54
8000.0      51
17000.0     45
26000.0     39
20000.0     38
40000.0     36
21000.0     34
19000.0     31
5000.0      30
24000.0     29
49000.0     27
0.0         26
29000.0     20
6000.0      20
33000.0     18
826.0       17
34000.0     16
          ... 
458.0        1
77000.0      1
763.0        1
961.0        1
701.0        1
123.0        1
575.0        1
481.0        1
107.0        1
279.0        1
188.0        1
619.0        1
652.0        1
237.0        1
764.0        1
335.0        1
494.0        1
732.0        1
712.0        1
91.0         1
437.0        1
406.0        1
762.0        1
432.0        1
644.0        1
362.0        1
216.0        1
859.0        1
225.0        1
334.0        1
Name: actor_1_facebook_likes, Length: 877, dtype: int64

You can quanitfy the number of elements in a series with one of three functions. Each has their respective uses: 

1. .size

2. .shape

3. len()

In [22]:
director.size

4916

In [23]:
director.shape

(4916,)

In [24]:
len(director)

4916

In [25]:
len(actor_1_fb_likes)

4916

*.count()* is used in other circumstances. It returns the number of non-missing values. Can be useful for screening for number of NaN values

In [26]:
director.count()

4814

In [27]:
actor_1_fb_likes.count()

4909

**Basic summary statistics** can be applied to numeric data types. For instance:

1. .sum()
2. .min()
3. .max()
4. .std()
5. .mean()
6. .median()

In [28]:
#Presenting summary statistics of actoractor_1_fb_likes as an array
actor_1_fb_likes.min(), actor_1_fb_likes.max(), actor_1_fb_likes.mean(), actor_1_fb_likes.median(), \
actor_1_fb_likes.std(), actor_1_fb_likes.sum()

(0.0, 640000.0, 6494.488490527602, 982.0, 15106.986883848185, 31881444.0)

.describe() method outputs all relevant summary staistics in one go

In [29]:
actor_1_fb_likes.describe()

count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

In [30]:
director.describe()

count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

.quantile() method exists to provide precise numerical quantiles of data

If you want multiple quantiles (0.1 --> 0.9), you need to put quantiles in a list, and then pass into .quantile()
method as an argument

In [31]:
actor_1_fb_likes.quantile([0.1, 0.2, 0.3, 0.4, 0.5, 
                           0.6, 0.7, 0.8, 0.9])

0.1      240.0
0.2      510.0
0.3      694.0
0.4      854.0
0.5      982.0
0.6     1000.0
0.7     8000.0
0.8    13000.0
0.9    18000.0
Name: actor_1_facebook_likes, dtype: float64

We know from out *.count()* method that our Series' contain NaN values. 

We can use the *.isnull()* method to return a boolean series of the same length as director or actor with True/False values corresponding to whether each value is NaN or not

In [32]:
director.isnull()

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
4886    False
4887    False
4888    False
4889    False
4890    False
4891    False
4892    False
4893    False
4894    False
4895    False
4896    False
4897    False
4898    False
4899    False
4900    False
4901    False
4902    False
4903    False
4904    False
4905    False
4906    False
4907    False
4908    False
4909    False
4910    False
4911    False
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

Passing a Series with NaN values through the *.fillna(x)* function will replace each NaN value with value **x**

In [33]:
actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
actor_1_fb_likes_filled.count()

4916

Can remove all entires with NaN with *.dropna()*

In [34]:
actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
actor_1_fb_likes_dropped.size

4909

We can get a different view of .value_counts() output if we set .value_counts(normalize=True)

In [35]:
director.value_counts(normalize = True)

Steven Spielberg             0.005401
Woody Allen                  0.004570
Martin Scorsese              0.004155
Clint Eastwood               0.004155
Ridley Scott                 0.003324
Spike Lee                    0.003324
Steven Soderbergh            0.003116
Renny Harlin                 0.003116
Tim Burton                   0.002908
Oliver Stone                 0.002908
Barry Levinson               0.002700
Robert Zemeckis              0.002700
Ron Howard                   0.002700
Robert Rodriguez             0.002700
Joel Schumacher              0.002700
Brian De Palma               0.002493
Kevin Smith                  0.002493
Tony Scott                   0.002493
Michael Bay                  0.002493
Sam Raimi                    0.002285
Richard Linklater            0.002285
Chris Columbus               0.002285
Francis Ford Coppola         0.002285
Shawn Levy                   0.002285
Richard Donner               0.002285
Rob Reiner                   0.002285
David Finche

Faster way to assess whether a Series has NaN values: *.hasnans* method:

In [36]:
director.hasnans

True

Now going to perform operations on Series, using common Python operators

Note that ever time you restart a Jupyter kernel, you need to re-run page to get variables loaded

In [37]:
imdb_score = movie['imdb_score']
imdb_score

0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
5       6.6
6       6.2
7       7.8
8       7.5
9       7.5
10      6.9
11      6.1
12      6.7
13      7.3
14      6.5
15      7.2
16      6.6
17      8.1
18      6.7
19      6.8
20      7.5
21      7.0
22      6.7
23      7.9
24      6.1
25      7.2
26      7.7
27      8.2
28      5.9
29      7.0
       ... 
4886    7.0
4887    6.3
4888    7.1
4889    4.8
4890    3.3
4891    6.9
4892    4.6
4893    3.0
4894    6.6
4895    7.4
4896    6.2
4897    4.0
4898    6.1
4899    6.9
4900    7.5
4901    6.7
4902    7.4
4903    6.1
4904    5.4
4905    6.4
4906    7.0
4907    6.3
4908    6.9
4909    7.8
4910    6.4
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

In [38]:
#Add a score unit to each value
imdb_score + 1

0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
5       7.6
6       7.2
7       8.8
8       8.5
9       8.5
10      7.9
11      7.1
12      7.7
13      8.3
14      7.5
15      8.2
16      7.6
17      9.1
18      7.7
19      7.8
20      8.5
21      8.0
22      7.7
23      8.9
24      7.1
25      8.2
26      8.7
27      9.2
28      6.9
29      8.0
       ... 
4886    8.0
4887    7.3
4888    8.1
4889    5.8
4890    4.3
4891    7.9
4892    5.6
4893    4.0
4894    7.6
4895    8.4
4896    7.2
4897    5.0
4898    7.1
4899    7.9
4900    8.5
4901    7.7
4902    8.4
4903    7.1
4904    6.4
4905    7.4
4906    8.0
4907    7.3
4908    7.9
4909    8.8
4910    7.4
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

In [39]:
#Any scalar operator works on numeric series data
imdb_score * 2.5

0       19.75
1       17.75
2       17.00
3       21.25
4       17.75
5       16.50
6       15.50
7       19.50
8       18.75
9       18.75
10      17.25
11      15.25
12      16.75
13      18.25
14      16.25
15      18.00
16      16.50
17      20.25
18      16.75
19      17.00
20      18.75
21      17.50
22      16.75
23      19.75
24      15.25
25      18.00
26      19.25
27      20.50
28      14.75
29      17.50
        ...  
4886    17.50
4887    15.75
4888    17.75
4889    12.00
4890     8.25
4891    17.25
4892    11.50
4893     7.50
4894    16.50
4895    18.50
4896    15.50
4897    10.00
4898    15.25
4899    17.25
4900    18.75
4901    16.75
4902    18.50
4903    15.25
4904    13.50
4905    16.00
4906    17.50
4907    15.75
4908    17.25
4909    19.50
4910    16.00
4911    19.25
4912    18.75
4913    15.75
4914    15.75
4915    16.50
Name: imdb_score, Length: 4916, dtype: float64

In [40]:
# Can also apply bolean operators
imdb_score > 7

0        True
1        True
2       False
3        True
4        True
5       False
6       False
7        True
8        True
9        True
10      False
11      False
12      False
13       True
14      False
15       True
16      False
17       True
18      False
19      False
20       True
21      False
22      False
23       True
24      False
25       True
26       True
27       True
28      False
29      False
        ...  
4886    False
4887    False
4888     True
4889    False
4890    False
4891    False
4892    False
4893    False
4894    False
4895     True
4896    False
4897    False
4898    False
4899    False
4900     True
4901    False
4902     True
4903    False
4904    False
4905    False
4906    False
4907    False
4908    False
4909     True
4910    False
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

In [41]:
director == 'James Cameron'

0        True
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26       True
27      False
28      False
29      False
        ...  
4886    False
4887    False
4888    False
4889    False
4890    False
4891    False
4892    False
4893    False
4894    False
4895    False
4896    False
4897    False
4898    False
4899    False
4900    False
4901    False
4902    False
4903    False
4904    False
4905    False
4906    False
4907    False
4908    False
4909    False
4910    False
4911    False
4912    False
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

One of the great features of Pandas: each operation we just performed operates on every element in the Series. In native Python, such an operation would require a for loop. This added functionanality is thanks to the NumPy library. 

There are dot notation method equivalents for each of these simple operators. 

e.g. + == .add()

Next, going to demonstrate method chaining

In [42]:
# Append .head() to the end of a method to suppress the long output
director.value_counts().head(3)

Steven Spielberg    26
Woody Allen         22
Martin Scorsese     20
Name: director_name, dtype: int64

**Below is a common method for counting number of missing values**

In [43]:
actor_1_fb_likes.isnull().sum()

7

In [44]:
actor_1_fb_likes.dtype

dtype('float64')

In [45]:
actor_1_fb_likes.head()

0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

Impossible to have 'partial' facebook like. But numeric series with missing values are always stored as a float (see dtype output above). So let's make the actor_1_fb_likes Series into integer by chaining

In [46]:
actor_1_fb_likes.fillna(0)\
                .astype(int)\
                .head()

0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int64

In [47]:
# Another way to view number of missing values --> percentage
actor_1_fb_likes.isnull().mean() * 100

0.14239218877135884

There is one big downside to method chaining. 

None of the intermediate objects in a method chain produces an output, so if one of the methods produces an unexpected or unusable result, it will be hard to trace where in the chain a bug occured. 

Can be better for debugging to sequentially assign method outputs to new variables. Can also be cumbersome, but it's a tradeoff you need to consider. 

# Really crucial skill here: reassigning indicies to be meaningful
**Going to change default index, which is a series running from 0 --> n-1, to be movie title**

Can also set this at import with index_col parameter of read_csv() method

In [48]:
movie_2 = movie.set_index('movie_title')
movie_2

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
Spider-Man 3,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
Tangled,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
Avengers: Age of Ultron,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
Harry Potter and the Half-Blood Prince,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000


In [49]:
movie_2 = pd.read_csv('data/movie.csv', 
          index_col= 'movie_title')
movie_2

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
Spider-Man 3,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
Tangled,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
Avengers: Age of Ultron,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
Harry Potter and the Half-Blood Prince,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000


**To change back, use .reset_index() command to return Movie Title as a column, and reset index as a numeric series**

If you want to rename all rows/columns, you can pass dictionaries into the .rename() method

In [50]:
movie = pd.read_csv('data/movie.csv', 
          index_col= 'movie_title')

In [51]:
idx_rename = {'Avatar' : 'Ratava', 'Spectre' : 'Ertceps'}
col_rename = {'director_name' : 'Director Name',
              'num_critic_for_reviews' : 'Critical_Reviews'}

In [52]:
movie_renamed = movie.rename(index = idx_rename,
                             columns = col_rename)
movie_renamed.head()

Unnamed: 0_level_0,color,Director Name,Critical_Reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


Can reassign col/index names driectly via a python list. Short example below

In [53]:
index = movie.index
print(index.dtype)
#reasign extracted index as a list
index_list = index.tolist()
#reassing values
index_list[0] = 'Ratava'
index_list[2] = 'Ertceps'
#replace existing index
movie.index = index_list
movie

object


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
Spider-Man 3,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
Tangled,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
Avengers: Age of Ultron,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
Harry Potter and the Half-Blood Prince,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000


Going to add new columns to this data frame by ussing an assignmen tmethod, and remove old ones with a drop method

In [54]:
movie['has seen'] = 0 
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,0
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,0
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,0
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,12.0,7.1,,0,0


In [55]:
# Can perform operations on columns and assign output to new column
movie['actor_director_facebook_likes'] = \
    (movie['actor_1_facebook_likes'] + 
     movie['actor_2_facebook_likes'] +
     movie['actor_3_facebook_likes'] +
     movie['director_facebook_likes'])
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen,actor_director_facebook_likes
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,0,2791.0
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,0,46563.0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,0,11554.0
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,0,95000.0
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,12.0,7.1,,0,0,


In [56]:
# Let's check for null values in new column, and then fill the NaN's
movie['actor_director_facebook_likes'].isnull().sum()

122

In [57]:
movie['actor_director_facebook_likes'] = \
movie['actor_director_facebook_likes'].fillna(0)
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen,actor_director_facebook_likes
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,0,2791.0
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,0,46563.0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,0,11554.0
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,0,95000.0
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,12.0,7.1,,0,0,0.0


In [58]:
import ipywidgets as widgets

In [59]:
# Going to check to see whether cast_total_likes is > than actor_driector_likes column we just made
movie['is_cast_likes_more'] = \
     (movie['cast_total_facebook_likes'] >=
      movie['actor_director_facebook_likes'])
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen,actor_director_facebook_likes,is_cast_likes_more
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,0,2791.0,True
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,0,46563.0,True
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,0,11554.0,True
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,0,95000.0,True
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,12.0,7.1,,0,0,0.0,True


In [60]:
# Check whether all is_cast_lieks_more values == True
movie['is_cast_likes_more'].all()

False

Output shows that at least one movie has actor_director > cast. Let's just drop the column we made and start over

In [61]:
movie = movie.drop('actor_director_facebook_likes',
                   axis = 'columns')
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen,is_cast_likes_more
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,0,True
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,0,True
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,0,True
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,0,True
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,12.0,7.1,,0,0,True


In [62]:
#Now we can recreate that column with just actors
movie['actor_total_facebook_likes'] = \
     (movie['actor_1_facebook_likes'] +
      movie['actor_2_facebook_likes'] +
      movie['actor_3_facebook_likes'])
movie['actor_total_facebook_likes'] = \
     movie['actor_total_facebook_likes'].fillna(0)
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen,is_cast_likes_more,actor_total_facebook_likes
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,0,True,2791.0
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,0,True,46000.0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,0,True,11554.0
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,0,True,73000.0
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,12.0,7.1,,0,0,True,0.0


In [63]:
#Now going to check this updated column against cast again
movie['is_cast_likes_more'] = \
     (movie['cast_total_facebook_likes'] >=
      movie['actor_total_facebook_likes'])
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen,is_cast_likes_more,actor_total_facebook_likes
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,0,True,2791.0
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,0,True,46000.0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,0,True,11554.0
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,0,True,73000.0
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,12.0,7.1,,0,0,True,0.0


In [64]:
movie['is_cast_likes_more'].all()

True

In [65]:
# Now lets see what percentage of cast likes come from top three actors
movie['pct_actor_cast_likes'] = \
     (movie['actor_total_facebook_likes']/
      movie['cast_total_facebook_likes'])
#Validate this
(movie['pct_actor_cast_likes'].min(),
 movie['pct_actor_cast_likes'].max())

(0.0, 1.0)

In [66]:
#Finally, can output column as a series with index labelled according to movie
movie['pct_actor_cast_likes'].head()

Ratava                                        0.577369
Pirates of the Caribbean: At World's End      0.951396
Ertceps                                       0.987521
The Dark Knight Rises                         0.683783
Star Wars: Episode VII - The Force Awakens    0.000000
Name: pct_actor_cast_likes, dtype: float64

Can add a column at a specific location in the dataframe using get_loc() method

In [67]:
profit_index = movie.columns.get_loc('gross') + 1
profit_index

9

In [68]:
#movie.insert(loc = profit_index,
##             value = movie['gross'] - movie['budget'])
#Alternative to drop method

movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,has seen,is_cast_likes_more,actor_total_facebook_likes,pct_actor_cast_likes
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,237000000.0,2009.0,936.0,7.9,1.78,33000,0,True,2791.0,0.577369
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,300000000.0,2007.0,5000.0,7.1,2.35,0,0,True,46000.0,0.951396
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,245000000.0,2015.0,393.0,6.8,2.35,85000,0,True,11554.0,0.987521
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,250000000.0,2012.0,23000.0,8.5,2.35,164000,0,True,73000.0,0.683783
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,12.0,7.1,,0,0,True,0.0,0.0


# Chapter 2: Essential DataFrame Operations 
**Let the fun begin!**

In this first lesson, going to subset movie dataframe 

In [69]:
#Need to pass list of columns desired for subset as a list object
movie_actor_director = movie[['actor_1_name', 'actor_2_name',
                              'actor_3_name', 'director_name']]
movie_actor_director.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
Ratava,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
Pirates of the Caribbean: At World's End,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
Ertceps,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
The Dark Knight Rises,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
Star Wars: Episode VII - The Force Awakens,Doug Walker,Rob Walker,,Doug Walker


In [70]:
#Even calling just one column needs double [[]]
movie[['director_name']].head()

Unnamed: 0,director_name
Ratava,James Cameron
Pirates of the Caribbean: At World's End,Gore Verbinski
Ertceps,Sam Mendes
The Dark Knight Rises,Christopher Nolan
Star Wars: Episode VII - The Force Awakens,Doug Walker


Passing a column name as a string produces a series

Passing a column name as a list produces a DataFrame

It will help readability if column names are passed into a list variable, and then passed into dataframe subset operation

There are other methods that facilitate column selection. Selecting by dtype, and filter are two of these methods

In [71]:
movie = pd.read_csv('data/movie.csv',
                    index_col = 'movie_title')

movie.get_dtype_counts()

float64    13
int64       3
object     11
dtype: int64

In [72]:
movie.select_dtypes(include=['int']).head()

#pass 'number' into include parameter to select all numeric data

Unnamed: 0_level_0,num_voted_users,cast_total_facebook_likes,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,886204,4834,33000
Pirates of the Caribbean: At World's End,471220,48350,0
Spectre,275868,11700,85000
The Dark Knight Rises,1144337,106759,164000
Star Wars: Episode VII - The Force Awakens,8,143,0


```filter``` method is a very common and flexible method for easy column selection. Selects column names based on severla inputs. In example below, finds all columns with string 'facebook' in name 

In [73]:
movie.filter(like = 'facebook').head()

Unnamed: 0_level_0,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,cast_total_facebook_likes,actor_2_facebook_likes,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Avatar,0.0,855.0,1000.0,4834,936.0,33000
Pirates of the Caribbean: At World's End,563.0,1000.0,40000.0,48350,5000.0,0
Spectre,0.0,161.0,11000.0,11700,393.0,85000
The Dark Knight Rises,22000.0,23000.0,27000.0,106759,23000.0,164000
Star Wars: Episode VII - The Force Awakens,131.0,,131.0,143,12.0,0


Now going to use the ```regex``` parameter to perform a more nuanced operation: selecting all comuns with digit in name

In [74]:
movie.filter(regex = '\d').head()

Unnamed: 0_level_0,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,actor_1_name,actor_3_name,actor_2_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Avatar,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
Pirates of the Caribbean: At World's End,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
Spectre,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0
The Dark Knight Rises,23000.0,Christian Bale,27000.0,Tom Hardy,Joseph Gordon-Levitt,23000.0
Star Wars: Episode VII - The Force Awakens,,Rob Walker,131.0,Doug Walker,,12.0


Note that ```\d``` is a regex parameter that specifies a numeric digit, or anything with a decimal place. By passing ```\d``` through a regex expression, we select only those column names that have a character that fits this definition. 

Other common regex parameters: 

Dollar sign = specifies charascter at end of line 

e.g. (Dollar sign)e means find column name with an e at the end

\s means space: find column name with space

\w to 'word' or string character: essentially same as \d but for strings

Refer to https://stackoverflow.com/questions/36982512/meaning-of-regular-expressions-like-d-d-etc for more

# Ordering column names sensibly

Going to screen dataset, and look for discrete vs. continuous sets of data

In [75]:
movie = pd.read_csv('data/movie.csv')
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [76]:
movie.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

Grouping columns according to discrete vs. continuous now that I can see names 

In [77]:
disc_core = ['movie_title', 'title_year',
             'content_rating', 'genres']
disc_people = ['director_name', 'actor_1_name',
               'actor_2_name', 'actor_3_name']
disc_other = ['color', 'country', 'language',
              'plot_keywords', 'movie_imdb_link']
cont_fb = ['director_facebook_likes', 'actor_1_facebook_likes', 
               'actor_2_facebook_likes', 'actor_3_facebook_likes',
               'cast_total_facebook_likes', 'movie_facebook_likes']
cont_finance = ['budget', 'gross']
cont_num_reviews = ['num_voted_users', 'num_user_for_reviews',
                        'num_critic_for_reviews']
cont_other = ['imdb_score', 'duration',
                  'aspect_ratio', 'facenumber_in_poster']


Now going to cocatenate these lists in a super-list, and pass into ```set()``` argument

In [78]:
new_col_order = disc_core + disc_people + \
                disc_other + cont_fb + \
                cont_finance + cont_num_reviews + \
                cont_other

In [79]:
#Check that your list contains all the columns from the movie dataset (you didn't miss one)
set(movie.columns) == set(new_col_order)


True

In [80]:
movie2 = movie[new_col_order]
movie2.head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,...,movie_facebook_likes,budget,gross,num_voted_users,num_user_for_reviews,num_critic_for_reviews,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Color,USA,...,33000,237000000.0,760505847.0,886204,3054.0,723.0,7.9,178.0,1.78,0.0
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action|Adventure|Fantasy,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Color,USA,...,0,300000000.0,309404152.0,471220,1238.0,302.0,7.1,169.0,2.35,0.0
2,Spectre,2015.0,PG-13,Action|Adventure|Thriller,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Color,UK,...,85000,245000000.0,200074175.0,275868,994.0,602.0,6.8,148.0,2.35,1.0
3,The Dark Knight Rises,2012.0,PG-13,Action|Thriller,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Color,USA,...,164000,250000000.0,448130642.0,1144337,2701.0,813.0,8.5,164.0,2.35,0.0
4,Star Wars: Episode VII - The Force Awakens,,,Documentary,Doug Walker,Doug Walker,Rob Walker,,,,...,0,,,8,,,7.1,,,0.0


# Operations on Whole DataFrame

In [81]:
movie = pd.read_csv('data/movie.csv')
movie.shape

(4916, 28)

In [82]:
movie.size

137648

In [83]:
movie.ndim

2

In [84]:
len(movie)

4916

Check data frame to see the number of non-missing values in each column

In [85]:
movie.count()

color                        4897
director_name                4814
num_critic_for_reviews       4867
duration                     4901
director_facebook_likes      4814
actor_3_facebook_likes       4893
actor_2_name                 4903
actor_1_facebook_likes       4909
gross                        4054
genres                       4916
actor_1_name                 4909
movie_title                  4916
num_voted_users              4916
cast_total_facebook_likes    4916
actor_3_name                 4893
facenumber_in_poster         4903
plot_keywords                4764
movie_imdb_link              4916
num_user_for_reviews         4895
language                     4904
country                      4911
content_rating               4616
budget                       4432
title_year                   4810
actor_2_facebook_likes       4903
imdb_score                   4916
aspect_ratio                 4590
movie_facebook_likes         4916
dtype: int64

In [86]:
movie.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4867.0,4901.0,4814.0,4893.0,4909.0,4054.0,4916.0,4916.0,4903.0,4895.0,4432.0,4810.0,4903.0,4916.0,4590.0,4916.0
mean,137.988905,107.090798,691.014541,631.276313,6494.488491,47644510.0,82644.92,9579.815907,1.37732,267.668846,36547490.0,2002.447609,1621.923516,6.437429,2.222349,7348.294142
std,120.239379,25.286015,2832.954125,1625.874802,15106.986884,67372550.0,138322.2,18164.31699,2.023826,372.934839,100242700.0,12.453977,4011.299523,1.127802,1.40294,19206.016458
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
25%,49.0,93.0,7.0,132.0,607.0,5019656.0,8361.75,1394.75,0.0,64.0,6000000.0,1999.0,277.0,5.8,1.85,0.0
50%,108.0,103.0,48.0,366.0,982.0,25043960.0,33132.5,3049.0,1.0,153.0,19850000.0,2005.0,593.0,6.6,2.35,159.0
75%,191.0,118.0,189.75,633.0,11000.0,61108410.0,93772.75,13616.75,2.0,320.5,43000000.0,2011.0,912.0,7.2,2.35,2000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,4200000000.0,2016.0,137000.0,9.5,16.0,349000.0


In [87]:
# Can even play with percentile parameters!
movie.describe(percentiles = [0.1, 0.3, 0.9])

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4867.0,4901.0,4814.0,4893.0,4909.0,4054.0,4916.0,4916.0,4903.0,4895.0,4432.0,4810.0,4903.0,4916.0,4590.0,4916.0
mean,137.988905,107.090798,691.014541,631.276313,6494.488491,47644510.0,82644.92,9579.815907,1.37732,267.668846,36547490.0,2002.447609,1621.923516,6.437429,2.222349,7348.294142
std,120.239379,25.286015,2832.954125,1625.874802,15106.986884,67372550.0,138322.2,18164.31699,2.023826,372.934839,100242700.0,12.453977,4011.299523,1.127802,1.40294,19206.016458
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
10%,17.0,86.0,0.0,32.0,240.0,374419.1,1593.5,509.5,0.0,21.0,1380020.0,1988.0,78.0,5.0,1.85,0.0
30%,60.0,95.0,11.0,176.0,694.0,7914069.0,11864.5,1684.5,0.0,80.0,8000000.0,2000.0,345.0,6.0,1.85,0.0
50%,108.0,103.0,48.0,366.0,982.0,25043960.0,33132.5,3049.0,1.0,153.0,19850000.0,2005.0,593.0,6.6,2.35,159.0
90%,294.0,134.0,545.0,890.8,18000.0,122902900.0,213880.5,25594.5,4.0,620.6,80000000.0,2014.0,3000.0,7.8,2.35,23000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,4200000000.0,2016.0,137000.0,9.5,16.0,349000.0


Statistical tools automatically skip over object columns with NaN values, and in numeric columns just ignore NaN values while still tabulating results 

Can invoke ```skipna = False``` to prevent silent removal of data 

In [88]:
movie.min(skipna = False)

  return umr_minimum(a, axis, None, out, keepdims, initial)


num_critic_for_reviews       NaN
duration                     NaN
director_facebook_likes      NaN
actor_3_facebook_likes       NaN
actor_1_facebook_likes       NaN
gross                        NaN
num_voted_users              5.0
cast_total_facebook_likes    0.0
facenumber_in_poster         NaN
num_user_for_reviews         NaN
budget                       NaN
title_year                   NaN
actor_2_facebook_likes       NaN
imdb_score                   1.6
aspect_ratio                 NaN
movie_facebook_likes         0.0
dtype: float64

# Chaining methods on a DataFrame

Going to count number of nulls in my dataframe 1) per column and 2) total

First, test isnull() function. Then run method chain.

In [89]:
movie.isnull().head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,True,True,False,True,False,False,True,False,...,True,True,True,True,True,True,False,False,True,False


In [90]:
movie.isnull().sum().head()

color                       19
director_name              102
num_critic_for_reviews      49
duration                    15
director_facebook_likes    102
dtype: int64

In [91]:
movie.isnull().sum().sum()

2654

In [92]:
movie.isnull().any().any()

True

In [93]:
# To run stat functions on object columns with NaN's, need to foll nan
movie.select_dtypes(['object']).fillna('').min()

color                                                               
director_name                                                       
actor_2_name                                                        
genres                                                        Action
actor_1_name                                                        
movie_title                                                  #Horror
actor_3_name                                                        
plot_keywords                                                       
movie_imdb_link    http://www.imdb.com/title/tt0006864/?ref_=fn_t...
language                                                            
country                                                             
content_rating                                                      
dtype: object

# Working with operators on a DataFrame

In [94]:
college = pd.read_csv('data/college.csv')
#college + 5

In [95]:
#Need to select homogenous data before using an operator on a DataFrame
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds = college.filter(like = 'UGDS_')
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [96]:
college_ugds + 0.00501

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03831,0.94031,0.01051,0.00691,0.00741,0.00691,0.00501,0.01091,0.01881
University of Alabama at Birmingham,0.59721,0.26501,0.03331,0.05681,0.00721,0.00571,0.04181,0.02291,0.01501
Amridge University,0.30401,0.42421,0.01191,0.00841,0.00501,0.00501,0.00501,0.00501,0.27651
University of Alabama in Huntsville,0.70381,0.13051,0.04321,0.04261,0.01931,0.00521,0.02221,0.03821,0.04001
Alabama State University,0.02081,0.92581,0.01711,0.00691,0.00601,0.00561,0.01481,0.02931,0.01871
The University of Alabama,0.78751,0.11691,0.03981,0.01561,0.00881,0.00591,0.03111,0.03181,0.00761
Central Alabama Community College,0.73051,0.26631,0.00941,0.00751,0.00941,0.00501,0.00501,0.00501,0.00691
Athens State University,0.78731,0.12501,0.02411,0.01031,0.02071,0.00601,0.02241,0.01071,0.03841
Auburn University at Montgomery,0.53781,0.34261,0.01241,0.02711,0.00941,0.00661,0.03471,0.04471,0.02961
Auburn University,0.85571,0.07541,0.02981,0.02771,0.01241,0.00501,0.00501,0.01501,0.01901


In [97]:
(college_ugds + 0.00501) // 0.01

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,3.0,94.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
University of Alabama at Birmingham,59.0,26.0,3.0,5.0,0.0,0.0,4.0,2.0,1.0
Amridge University,30.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,27.0
University of Alabama in Huntsville,70.0,13.0,4.0,4.0,1.0,0.0,2.0,3.0,4.0
Alabama State University,2.0,92.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0
The University of Alabama,78.0,11.0,3.0,1.0,0.0,0.0,3.0,3.0,0.0
Central Alabama Community College,73.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Athens State University,78.0,12.0,2.0,1.0,2.0,0.0,2.0,1.0,3.0
Auburn University at Montgomery,53.0,34.0,1.0,2.0,0.0,0.0,3.0,4.0,2.0
Auburn University,85.0,7.0,2.0,2.0,1.0,0.0,0.0,1.0,1.0


In [98]:
college_ugds_op_round = (college_ugds + 0.00501) // 0.01 / 100
college_ugds_op_round.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


Can invoke ```.round()``` method, after adding small fraction, to do the same thing 

In [99]:
college_ugds_round = (college_ugds + 0.00001).round(2)
college_ugds_round.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.0,0.0,0.0,0.0,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.0,0.0,0.04,0.02,0.01
Amridge University,0.3,0.42,0.01,0.0,0.0,0.0,0.0,0.0,0.27
University of Alabama in Huntsville,0.7,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.0,0.0,0.0,0.01,0.02,0.01


In [100]:
college_ugds_round.equals(college_ugds_op_round)

True

Key Pandas insight: ```NaN``` object does not equal itself! One of those intricacies of NumPy that isn't obvious.

For instance, the Python object ```None``` does equal itself...

In [101]:
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


Best way to compare data sets to eahcv other: ```.equals()``` method, as opposed to ```==```

Reason you don't use == is because NaN's aren't equal to each other, so you will always get ```False```

# Transpose methods on different axis 

In [102]:
college_ugds.count()

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [103]:
college_ugds.count(axis = 'columns').head()

INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

In [104]:
college_ugds.sum(axis = 'columns').head()

INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

In [105]:
college_ugds.median(axis = 'index').head()

UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
dtype: float64

# Determining college campus diversity

Cleans data of NaN's, and computes a metric

In [106]:
pd.read_csv('data/college_diversity.csv', index_col = 'School')

Unnamed: 0_level_0,Diversity Index
School,Unnamed: 1_level_1
"Rutgers University--Newark Newark, NJ",0.76
"Andrews University Berrien Springs, MI",0.74
"Stanford University Stanford, CA",0.74
"University of Houston Houston, TX",0.74
"University of Nevada--Las Vegas Las Vegas, NV",0.74
"University of San Francisco San Francisco, CA",0.74
"San Francisco State University San Francisco, CA",0.73
"University of Illinois--Chicago Chicago, IL",0.73
"New Jersey Institute of Technology Newark, NJ",0.72
"Texas Woman's University Denton, TX",0.72


In [107]:
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [108]:
#Check or null values, sum up boolean counts, sort dataframe with highest number of nulls at top, and display

college_ugds.isnull() \
            .sum(axis = 1) \
            .sort_values(ascending = False) \
            .head()

INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

In [109]:
college_ugds = college_ugds.dropna(how = 'all')
#IF the rows with data had NaN's, we could follow this method with a .fillna('0') command
college_ugds.isnull().sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

We now have no NaN values in dataset

Going to start on diversity calculations by using greater than or equal to tool

In [110]:
college_ugds.ge(0.15)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,True,False,False,False,False,False,False,False
University of Alabama at Birmingham,True,True,False,False,False,False,False,False,False
Amridge University,True,True,False,False,False,False,False,False,True
University of Alabama in Huntsville,True,False,False,False,False,False,False,False,False
Alabama State University,False,True,False,False,False,False,False,False,False
The University of Alabama,True,False,False,False,False,False,False,False,False
Central Alabama Community College,True,True,False,False,False,False,False,False,False
Athens State University,True,False,False,False,False,False,False,False,False
Auburn University at Montgomery,True,True,False,False,False,False,False,False,False
Auburn University,True,False,False,False,False,False,False,False,False


In [111]:
diversity_metric = college_ugds.ge(0.15).sum(axis = 'columns')
diversity_metric.head()

INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

In [112]:
diversity_metric.value_counts()

1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

Two schools have more than 5 racial categories > 15%! How can that be?

In [113]:
diversity_metric.sort_values(ascending = False).head()

INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

In [114]:
college_ugds.loc[['Regency Beauty Institute-Austin',
                  'Central Texas Beauty College-Temple']]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Regency Beauty Institute-Austin,0.1867,0.2133,0.16,0.0,0.0,0.0,0.1733,0.0,0.2667
Central Texas Beauty College-Temple,0.1616,0.2323,0.2626,0.0202,0.0,0.0,0.1717,0.0,0.1515


# Data Analysis Routine #

## Beginning Data Analysis ##

Key with data analysis is to develop a routine as part of *Data Exploration*

Want to have a key suite of tools you use to visually and statistically analyze a dataset upon first ecountering it

In [115]:
college = pd.read_csv('data/college.csv')
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


First thing you want: what are the dimmensions of the dataset? Use ```.shape``` method

In [116]:
college.shape

(7535, 27)

```.info()``` command useful for displaying metadata about each column: number of non null values and dtype

In [119]:
college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
INSTNM                7535 non-null object
CITY                  7535 non-null object
STABBR                7535 non-null object
HBCU                  7164 non-null float64
MENONLY               7164 non-null float64
WOMENONLY             7164 non-null float64
RELAFFIL              7535 non-null int64
SATVRMID              1185 non-null float64
SATMTMID              1196 non-null float64
DISTANCEONLY          7164 non-null float64
UGDS                  6874 non-null float64
UGDS_WHITE            6874 non-null float64
UGDS_BLACK            6874 non-null float64
UGDS_HISP             6874 non-null float64
UGDS_ASIAN            6874 non-null float64
UGDS_AIAN             6874 non-null float64
UGDS_NHPI             6874 non-null float64
UGDS_2MOR             6874 non-null float64
UGDS_NRA              6874 non-null float64
UGDS_UNKN             6874 non-null float64
PPTUG_EF          

Next step is to get a look at some summary statistics. What is the spread of data? The mean? 

Way to do this in an efficient and pleasing way is using a method chain that combines the ```.describe()``` method with the ```.T``` method, which transposes the output table to show columns as rows. 

The code shown below is a really powerful way to summarize data! 

Remember that the ```.describe()``` method is flexible and allows the suer to specify returned parameters e.g. the range of percentiles produced in the output dataframe

In [120]:
college.describe(include = [np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


Do the same thing for categorical data 

In [121]:
college.describe(include = [np.object, pd.Categorical]).T

Unnamed: 0,count,unique,top,freq
INSTNM,7535,7535,Florida National University-Main Campus,1
CITY,7535,2514,New York,87
STABBR,7535,59,CA,773
MD_EARN_WNE_P10,6413,598,PrivacySuppressed,822
GRAD_DEBT_MDN_SUPP,7503,2038,PrivacySuppressed,1510


## Data Dictionaries: The Data Key to A Novice User ##

Abbreviations used for column names or variables might be hard to understand. So it's a good practice to keep a *data dictionary* that records information about each column, including notes about data sources, abbreviations, changes, etc.

The example dataset has this: 

In [127]:
college_ddictionary = pd.read_csv('data/college_data_dictionary.csv')
college_ddictionary

Unnamed: 0,column_name,description
0,INSTNM,Institution Name
1,CITY,City Location
2,STABBR,State Abbreviation
3,HBCU,Historically Black College or University
4,MENONLY,0/1 Men Only
5,WOMENONLY,0/1 Women only
6,RELAFFIL,0/1 Religious Affiliation
7,SATVRMID,SAT Verbal Median
8,SATMTMID,SAT Math Median
9,DISTANCEONLY,Distance Education Only


Pandas has precise definitions for variables. By changing a generic data type, we can increase storage efficiency and make a column more useful

In [123]:
#Call college dataset like before

different_cols = ['RELAFFIL', 'SATMTMID', 'CURROPER',
                  'INSTNM', 'STABBR']

col2 = college.loc[:, different_cols]

col2.head()

Unnamed: 0,RELAFFIL,SATMTMID,CURROPER,INSTNM,STABBR
0,0,420.0,1,Alabama A & M University,AL
1,0,565.0,1,University of Alabama at Birmingham,AL
2,1,,1,Amridge University,AL
3,0,590.0,1,University of Alabama in Huntsville,AL
4,0,430.0,1,Alabama State University,AL


In [125]:
col2.dtypes

RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

In [126]:
col2.memory_usage(deep=True)

Index           80
RELAFFIL     60280
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

At this point, notice that 
1. religious affiliation only has 1's and 0's (data dictionary informed), yet takes up 60,000 bytes and 
2. STABBR (state abbreviation) could easily be reduced form object datat to a more effecient categorical type.

Can address these issues by changing data type

In [129]:
col2['RELAFFIL'] = col2['RELAFFIL'].astype(np.int8)

col2.dtypes

RELAFFIL       int8
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

In [131]:
col2.memory_usage(deep=True)

Index           80
RELAFFIL      7535
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

In [134]:
col2['STABBR'] = col2['STABBR'].astype('category')

col2.dtypes

RELAFFIL        int8
SATMTMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object

In [136]:
#Compare new vs. old memory usage. Great job!

new_mem = col2.memory_usage(deep=True)

old_mem = college[different_cols].memory_usage(deep=True)

(new_mem / old_mem) * 100

Index       100.000000
RELAFFIL     12.500000
SATMTMID    100.000000
CURROPER    100.000000
INSTNM      100.000000
STABBR        3.053772
dtype: float64

## Making the Headlines: Finding the biggest or smallest of a subset of Data