# Extra Credit Opportunity: Using the Pandas Package to Redo Labs
- This notebook will be an introduction to pandas
- In order to receive extra credit, you will be asked to re-do certain labs.
    - The labs that use tables are created using the datascience package version of tables. Instead, you will be tasked with creating/modifying them with the pandas package's version of tables.
    - It's also important to note that this is an introduction to pandas and not a strict guide on what functions you have to use. Feel free to look up different solutions for your extra credit work!


### Import Pandas
- To start using the pandas package, import it.

In [13]:
import pandas as pd

### Pandas Dataframes
- Pandas has a special name for tables. They call them **DataFrames**.
- A **DataFrame** is a 2-dimensional data structure that can store data of different types in columns.

In [14]:
# Creating a New Dataframe:
dataFrameExample = pd.DataFrame({
    "Name": ["Christine", "Cait", "Paris", "Marium", "Liam"], 
    "Age": [20, 21, 20, 21, 21], 
    "Sex": ["F", "F", "F", "F", "M"]
})

# Add a new column:
dataFrameExample["Favorite Food"] = ["Pasta","?","!","Icecream","Ch"]

# Another way to add a column but at a specific position:
# At column position 1, called 'School' and add the values : [...]
dataFrameExample.insert(1, "School", ["Temple","Temple","Temple","Temple","uh"])

# Drop a column:
# The i'nplace=True' allows me to make the changes DIRECTLY to the dataframe instead of doing 'dataFrame = dataFrame.drop...'
dataFrameExample.drop(["Sex", "Favorite Food"], axis=1, inplace=True)

dataFrameExample

Unnamed: 0,Name,School,Age
0,Christine,Temple,20
1,Cait,Temple,21
2,Paris,Temple,20
3,Marium,Temple,21
4,Liam,uh,21


### Read in a CSV file
- Read in film data from the CSV file 'imdb.csv' and store it in a **DataFrame**:

In [15]:
# Command for reading in a CSV file with the PANDAS package: 
imdbPandas = pd.read_csv('imdb.csv')

# Show the dataframe:
imdbPandas

Unnamed: 0,Votes,Rating,Title,Year,Decade
0,88355,8.4,M,1931,1930
1,132823,8.3,Singin' in the Rain,1952,1950
2,74178,8.3,All About Eve,1950,1950
3,635139,8.6,Léon,1994,1990
4,145514,8.2,The Elephant Man,1980,1980
...,...,...,...,...,...
245,1078416,8.7,Forrest Gump,1994,1990
246,31003,8.1,Le salaire de la peur,1953,1950
247,167076,8.2,3 Idiots,2009,2000
248,91689,8.1,Network,1976,1970


### Compare to Tables from the datascience package
- Notice how it differs from the tables we're used to in the labs: 

In [16]:
# Import the datascience package we've been using in the labs for tables:
from datascience import * 

# Command for reading in a CSV file with the datascience package:
imdbDatascience = Table.read_table("imdb.csv")

imdbDatascience

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
132823,8.3,Singin' in the Rain,1952,1950
74178,8.3,All About Eve,1950,1950
635139,8.6,Léon,1994,1990
145514,8.2,The Elephant Man,1980,1980
425461,8.3,Full Metal Jacket,1987,1980
441174,8.1,Gone Girl,2014,2010
850601,8.3,Batman Begins,2005,2000
37664,8.2,Judgment at Nuremberg,1961,1960
46987,8.0,Relatos salvajes,2014,2010


### What's the difference?
- DataFrames from the pandas package and Tables from the datascience package are almost the same/
    - They'll have some functions that do the same thing with different names. 
    - **EXAMPLE FROM ABOVE:** the datascience package uses ```'Table.read_table()'``` and pandas uses ```'pd.read_csv()'``` to do the same job of reading a csv file. 
- This doesn't mean every function in pandas has an equivalent function in the datascience package and vice versa.


## Pandas Series
- Pandas also has a special name for columns in dataframes! They call them **Series**
- A Pandas **Series** is a column in a dataframe but you can also think of it as a special array.

In [17]:
# Here, I'm creating a series with the values 22, 36, and 34. I'm also specifiying the name of the column as 'Age'.
ages = pd.Series([22, 36, 34], name="Age")

# Show what the series looks like: 
ages

0    22
1    36
2    34
Name: Age, dtype: int64

In [18]:
### Use the type command to see it's a Series object 
type(ages)

pandas.core.series.Series

In [19]:
# I can also grab a column/series from an existing dataframe. 
# From the dataframe 'imdbPandas' we created earlier, I'm going to grab the 'Rating' column and store it in a variable.
ratingColumn = imdbPandas["Rating"]

# Compare the values of ratingColumn with the column 'Rating' in the imdb dataframe. They are the same!
print(ratingColumn)

# If I check the type of the column: It's also a series 
type(ratingColumn)

0      8.4
1      8.3
2      8.3
3      8.6
4      8.2
      ... 
245    8.7
246    8.1
247    8.2
248    8.1
249    8.3
Name: Rating, Length: 250, dtype: float64


pandas.core.series.Series

## Other Useful Functions!

- Let's take a look at some other functions and code you may use to complete the extra credit.


In [20]:
# Number of rows:
print("Number of Rows: ", len(imdbPandas))

# Number of columns:
print("\nNumber of Columns: ", len(imdbPandas.columns))

# Number of rows and columns
print("Number of Rows + Columns: ", imdbPandas.shape)

# All the labels of the columns
print("DataFrame Labels: ", imdbPandas.columns)

Number of Rows:  250

Number of Columns:  5
Number of Rows + Columns:  (250, 5)
DataFrame Labels:  Index(['Votes', 'Rating', 'Title', 'Year', 'Decade'], dtype='object')


### [sort_values()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)
- Sort dataframe values by rows 

In [21]:
# Sort the imdb dataframe to get the movies with the greatest amount of votes first
imdbPandas.sort_values("Votes", ascending=False)

Unnamed: 0,Votes,Rating,Title,Year,Decade
53,1498733,9.2,The Shawshank Redemption,1994,1990
76,1473049,8.9,The Dark Knight,2008,2000
81,1271949,8.7,Inception,2010,2010
87,1177098,8.8,Fight Club,1999,1990
224,1166532,8.9,Pulp Fiction,1994,1990
...,...,...,...,...,...
102,35983,8.1,The Best Years of Our Lives,1946,1940
28,32385,8.0,La battaglia di Algeri,1966,1960
246,31003,8.1,Le salaire de la peur,1953,1950
182,28012,8.0,Le samouraï,1967,1960


### [head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)
- Grab the first N rows of a dataframe



In [22]:
# Show the first 7 rows of the imdb dataframe:
imdbPandas.head(7)

Unnamed: 0,Votes,Rating,Title,Year,Decade
0,88355,8.4,M,1931,1930
1,132823,8.3,Singin' in the Rain,1952,1950
2,74178,8.3,All About Eve,1950,1950
3,635139,8.6,Léon,1994,1990
4,145514,8.2,The Elephant Man,1980,1980
5,425461,8.3,Full Metal Jacket,1987,1980
6,441174,8.1,Gone Girl,2014,2010


### [tail()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html)
- Grab the last N rows of a dataframe


In [23]:
# Show the last 10 rows of the imdb dataframe:
imdbPandas.tail(10)

Unnamed: 0,Votes,Rating,Title,Year,Decade
240,117590,8.4,The Great Dictator,1940,1940
241,85012,8.1,Strangers on a Train,1951,1950
242,476501,8.6,Cidade de Deus,2002,2000
243,268905,8.4,Citizen Kane,1941,1940
244,69988,8.1,8½,1963,1960
245,1078416,8.7,Forrest Gump,1994,1990
246,31003,8.1,Le salaire de la peur,1953,1950
247,167076,8.2,3 Idiots,2009,2000
248,91689,8.1,Network,1976,1970
249,589477,8.3,Eternal Sunshine of the Spotless Mind,2004,2000


#### Grabbing specific columns

In [24]:
# Say I'm only interested in looking at the 'Year' and 'Title' columns
imdbPandas[["Title", "Year"]]

Unnamed: 0,Title,Year
0,M,1931
1,Singin' in the Rain,1952
2,All About Eve,1950
3,Léon,1994
4,The Elephant Man,1980
...,...,...
245,Forrest Gump,1994
246,Le salaire de la peur,1953
247,3 Idiots,2009
248,Network,1976


#### Grabbing specific rows

In [25]:
# Say I'm only interested in looking at movies released AFTER 1950

# The condition inside the selection brackets imdbPandas["Year"] > 1950 checks for which rows the 'Year' column has a value larger than 1950:
imdbPandas[imdbPandas["Year"] > 1950]

Unnamed: 0,Votes,Rating,Title,Year,Decade
1,132823,8.3,Singin' in the Rain,1952,1950
3,635139,8.6,Léon,1994,1990
4,145514,8.2,The Elephant Man,1980,1980
5,425461,8.3,Full Metal Jacket,1987,1980
6,441174,8.1,Gone Girl,2014,2010
...,...,...,...,...,...
245,1078416,8.7,Forrest Gump,1994,1990
246,31003,8.1,Le salaire de la peur,1953,1950
247,167076,8.2,3 Idiots,2009,2000
248,91689,8.1,Network,1976,1970


In [26]:
# OR say I wanted movies that were released in 1981 and 2013

# The condition inside the selection brackets imdbPandas["Year"].isin([1981, 2013]) checks for which rows the Year column is either 1981 or 2013.
imdbPandas[imdbPandas["Year"].isin([1981, 2013])]

Unnamed: 0,Votes,Rating,Title,Year,Decade
34,359121,8.1,12 Years a Slave,2013,2010
142,624943,8.2,The Wolf of Wall Street,2013,2010
193,312516,8.0,Prisoners,2013,2010
198,258882,8.1,Rush (2013/I),2013,2010
231,151256,8.4,Das Boot,1981,1980
236,585474,8.5,Raiders of the Lost Ark,1981,1980


In [27]:
# The code below will give the same result as the one above.
# Instead of using the function isin(), two expressions are used to check for rows where there Year column is 1981 or 2013. 
# The '|' means OR
imdbPandas[(imdbPandas["Year"] == 1981) | (imdbPandas["Year"] == 2013)]


Unnamed: 0,Votes,Rating,Title,Year,Decade
34,359121,8.1,12 Years a Slave,2013,2010
142,624943,8.2,The Wolf of Wall Street,2013,2010
193,312516,8.0,Prisoners,2013,2010
198,258882,8.1,Rush (2013/I),2013,2010
231,151256,8.4,Das Boot,1981,1980
236,585474,8.5,Raiders of the Lost Ark,1981,1980


## How do I select specific rows and columns from a DataFrame?
- The ```iloc()``` function is used to grab specific rows and columns.
- It is INTEGER INDEX based so you have specify rows/columnsusing their integer index/location.

In [28]:
# iloc() EXAMPLE : Say I wanted to grab rows 40 to 66, and columns 1 to 3 

imdbPandas.iloc[39:66, 0:3]

Unnamed: 0,Votes,Rating,Title
39,108128,8.1,Cool Hand Luke
40,525515,8.1,A Beautiful Mind
41,79615,8.5,Inside Out (2015/I)
42,309141,8.5,Dr. Strangelove or: How I Learned to Stop Worr...
43,43090,8.0,"Paris, Texas (1984)"
44,536053,8.3,Snatch.
45,433487,8.1,The Bourne Ultimatum
46,427099,8.0,X-Men: Days of Future Past
47,767224,8.6,The Silence of the Lambs
48,752122,8.5,Memento


## Extra: loc()
- The ```loc()``` function is also used to grab specific rows and columns.
- It is LABEL based so you specify rows/columns based on their labels.
- You probably won't have to use loc() unless you're ever dealing with a dataframe that uses label-based indices.

In [29]:
# Here's we're making a copy of the imdb dataframe with label-based indices

# First: I'm creating an array to make my dataframe's indices label-based: (Name of the movie + "_MOVIE")
label_index = (imdbPandas["Title"] + "_MOVIE").array

# Next: Make a deep copy of the imdbPandas dataframe (so any changes to the copy aren't reflected in the original dataframe)
imdbPandasCopy = imdbPandas.copy()

# Set the copy dataframe's indices to the array we created in the first step
imdbPandasCopy.index = label_index

imdbPandasCopy

# NOTICE below how instead of numbers for each movie's index, it shows the labels created earlier with the format "(movie name)_MOVIE"

Unnamed: 0,Votes,Rating,Title,Year,Decade
M_MOVIE,88355,8.4,M,1931,1930
Singin' in the Rain_MOVIE,132823,8.3,Singin' in the Rain,1952,1950
All About Eve_MOVIE,74178,8.3,All About Eve,1950,1950
Léon_MOVIE,635139,8.6,Léon,1994,1990
The Elephant Man_MOVIE,145514,8.2,The Elephant Man,1980,1980
...,...,...,...,...,...
Forrest Gump_MOVIE,1078416,8.7,Forrest Gump,1994,1990
Le salaire de la peur_MOVIE,31003,8.1,Le salaire de la peur,1953,1950
3 Idiots_MOVIE,167076,8.2,3 Idiots,2009,2000
Network_MOVIE,91689,8.1,Network,1976,1970


In [30]:
# Now we can grab rows and columns from a dataframe with label-based indices

# Say I wanted to grab rows from the index 'M_MOVIE' to '3 Idiots_MOVIE', and the columns ranging from 'Title' to 'Decade'
imdbPandasCopy.loc['M_MOVIE':'3 Idiots_MOVIE', 'Title':'Decade']

Unnamed: 0,Title,Year,Decade
M_MOVIE,M,1931,1930
Singin' in the Rain_MOVIE,Singin' in the Rain,1952,1950
All About Eve_MOVIE,All About Eve,1950,1950
Léon_MOVIE,Léon,1994,1990
The Elephant Man_MOVIE,The Elephant Man,1980,1980
...,...,...,...
Citizen Kane_MOVIE,Citizen Kane,1941,1940
8½_MOVIE,8½,1963,1960
Forrest Gump_MOVIE,Forrest Gump,1994,1990
Le salaire de la peur_MOVIE,Le salaire de la peur,1953,1950


## What to Submit:
- Re-do labs 3-? without the use of datascience Tables. Only use pandas DataFrames!
- If there is anything you're unsure of please reach out to Professor Smith. 

## Pandas Documentation
- Part of learning how to code is learning how to read documentation!
- Everything you need to know about pandas can be found at its [documentation site](https://pandas.pydata.org/docs/index.html).
- I pulled a lot from the site's tutorials so please read further if you want to learn more: 
   - [Getting started tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html#getting-started-tutorials)