# Pandas Basics

### Install and import

- Pandas is an easy package to install
- Install it using either of the following commands:
   - conda install pandas
   - pip install pandas

- The ! at the beginning runs cells as if they were in a terminal
- `!pip install pandas`

In [1]:
#loading libraries
import numpy as np
import pandas as pd

### DataFrame Vs Series

- A Series is essentially a column
  - one-dimensional array of values with an index
- A DataFrame is a multi-dimensional table made up of a collection of Series
  - Two-dimensional array of values with both a row and a column index
  
### Difference between import pandas as pd Vs from pandas import *

- "`import pandas as pd`" imports the pandas module under the pandas/pd namespace
   - Hence objects within pandas library need to be called using `pd.xyz`
- "`from pandas import *`" imports all objects from the pandas module
   - Hence objects within pandas library need to be called using only `xyz`

In [2]:
#Creating a series data structure
# "import pandas as pd"
randn = np.random.random
#s = pd.Series(randn(3),('a','b','c'))
s = pd.Series((5,6,2),('a','b','c'))
s

a    5
b    6
c    2
dtype: int64

In [3]:
round(s.mean(),2)

4.33

In [4]:
s.values

array([5, 6, 2], dtype=int64)

In [5]:
s.index

Index(['a', 'b', 'c'], dtype='object')

In [6]:
# Reindexing labels
s.reindex(['c','b','a'])

c    2
b    6
a    5
dtype: int64

In [7]:
x = s+s
x

a    10
b    12
c     4
dtype: int64

In [8]:
# Series work with Numpy
# exp(x) = expoenential of x
# ln(exp(x)) = x
np.exp(s)

a    148.413159
b    403.428793
c      7.389056
dtype: float64

In [9]:
# "from pandas import *"
from pandas import *
s1 = Series(randn(3),('a','b','c'))

In [10]:
s = Series(randn(3),('a','b','c'))
d = {'one': s*s, 'two': s+s}
df = DataFrame(d)
df

Unnamed: 0,one,two
a,0.166592,0.816314
b,0.122333,0.699522
c,0.0025,0.099992


In [11]:
df.values

array([[0.16659226, 0.81631429],
       [0.12233262, 0.69952161],
       [0.00249958, 0.09999153]])

In [12]:
df.index

Index(['a', 'b', 'c'], dtype='object')

In [13]:
df.columns

Index(['one', 'two'], dtype='object')

In [14]:
# Add a third column
df['three'] = s * 3
df1 = df.copy() # duplicting the dataframe
df

Unnamed: 0,one,two,three
a,0.166592,0.816314,1.224471
b,0.122333,0.699522,1.049282
c,0.0025,0.099992,0.149987


#### Access to columns & Rows
- Access by attribute
- Access by notation
- Access by labels
- Access by integer-location | selection by position

In [15]:
# Access by attribute
df.one

a    0.166592
b    0.122333
c    0.002500
Name: one, dtype: float64

In [16]:
# Access by notation
df['one']

a    0.166592
b    0.122333
c    0.002500
Name: one, dtype: float64

In [17]:
# Access by labels
df.loc[['a','b']]

Unnamed: 0,one,two,three
a,0.166592,0.816314,1.224471
b,0.122333,0.699522,1.049282


In [18]:
# Access by integer-location | selection by position
df.iloc[[0,1]]

Unnamed: 0,one,two,three
a,0.166592,0.816314,1.224471
b,0.122333,0.699522,1.049282


In [19]:
df1

Unnamed: 0,one,two,three
a,0.166592,0.816314,1.224471
b,0.122333,0.699522,1.049282
c,0.0025,0.099992,0.149987


In [20]:
# With a slice object
# df.iloc[start:stop] | passing integers to extract rows
df.iloc[1:3]

Unnamed: 0,one,two,three
b,0.122333,0.699522,1.049282
c,0.0025,0.099992,0.149987


In [21]:
# With a slice object
# passing a list
df.iloc[[1,2]]

Unnamed: 0,one,two,three
b,0.122333,0.699522,1.049282
c,0.0025,0.099992,0.149987


In [22]:
# Reindexing labels
df.reindex(['c','b','a'])
df

Unnamed: 0,one,two,three
a,0.166592,0.816314,1.224471
b,0.122333,0.699522,1.049282
c,0.0025,0.099992,0.149987


In [23]:
# Drop a row by index
df.drop('c')

Unnamed: 0,one,two,three
a,0.166592,0.816314,1.224471
b,0.122333,0.699522,1.049282


In [24]:
# Drop columns
df.drop(columns=['one'])

Unnamed: 0,two,three
a,0.816314,1.224471
b,0.699522,1.049282
c,0.099992,0.149987


In [25]:
# Drop columns and/or rows of MultiIndex DataFrame
df.drop(index='a', columns='three')

Unnamed: 0,one,two
b,0.122333,0.699522
c,0.0025,0.099992


In [26]:
# Descriptive statistics
# Also count, sum, median, min,max, abs, prod, std, var etc. can be performed
df1.mean()

one      0.097141
two      0.538609
three    0.807914
dtype: float64

#### Group by: split-apply-combine

- Merge, join and aggregate
- Reshaping and Pivot Tables
- Time Series / Date functionality
- Plotting with matplotlib
- Much more can be done

In [27]:
# hide warnings
import warnings
warnings.filterwarnings('ignore')

In [28]:
# to check the directory
import os
os.getcwd()

'C:\\Users\\arock.000\\Downloads'

In [29]:
# to change current directory
os.chdir("C:/Users/arock.000/Desktop/Test")
os.getcwd()

'C:\\Users\\arock.000\\Desktop\\Test'

In [30]:
# to display all the columns
pd.set_option('display.max_columns', None)

## How to read data

It’s quite simple to load data from various file formats into a DataFrame

###  Reading data from a CSV file

In [31]:
# CSV file
purchases = pd.read_csv('purchases.csv')
purchases

Unnamed: 0,Name,Mango,Apple
0,John,7,9
1,Jim,1,2
2,Katy,3,6
3,Bob,4,4


###  Reading data from a Excel workbook with multple sheets

In [32]:
# Excel file
sample = pd.read_excel('Sample.xlsx',sheet_name='Sheet 2')
sample.head(3)

Unnamed: 0,Name,Mango,Apple
0,Amar,9,7
1,Arun,2,1
2,Reena,6,3


### Reading data from a JSON
#### df = pd.read_json('purchases.json')

- Pandas analyses the Json structure and creates a dataframe automatically
- However you might need to set the `orient` keyword argument depending on the structure
- Check out this link ([JSON](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html)) for more details

### Reading data from a SQL database
- Install appropriate python library to connect to SQL or any relational databases
- Example:
  - `pysqlite3` library for SQLite - `!pip install pysqlite3`([Link](https://pypi.org/project/pysqlite3/))
  - `psycopg2` library for PostgreSQL - `!pip install psycopg2` ([Link](https://pypi.org/project/psycopg2/#files))
  - `mysql-connector` library for MySQL - `!pip install mysql-connector` ([Link](https://www.w3schools.com/python/python_mysql_getstarted.asp))
  - ([Microsoft Docs](https://docs.microsoft.com/en-us/sql/connect/python/python-driver-for-sql-server?view=sql-server-ver15))
  

- SQLite example: df = pd.read_sql_query("SELECT * FROM databasename", con)

In [33]:
# Converting back to a CSV, JSON, or SQL

# df.to_csv('new_filename.csv')
# df.to_json('new_filename.json')

In [34]:
movies = pd.read_csv("IMDB-Movie-Data.csv")
movies.head(3)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


In [35]:
movies.tail(3)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
999,1000,Nine Lives,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


### Information about the data
- `.info()` provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using 

In [36]:
movies.info()
# try movies.dtypes
# try movies.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
Rank                  1000 non-null int64
Title                 1000 non-null object
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


In [37]:
count_row = movies.shape[0]  # gives number of row count
count_col = movies.shape[1]  # gives number of col count
count_row, count_col

(1000, 12)

In [38]:
from IPython.display import Image

### Handling duplicates

- This data does not have duplicates
- `drop_duplicates()` method will remove duplicates - `movies.drop_duplicates(inplace=True)`
- It's a little verbose to keep assigning DataFrames to the same variable
- For this reason, pandas has the `inplace` keyword argument on many of its methods
- Using `inplace=True` will modify the DataFrame object in place ([Learn More](https://medium.com/@jman4190/explaining-the-inplace-parameter-for-beginners-5de7ffa18d2e))

In [39]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://miro.medium.com/max/1400/1*CkYtTM7RpMBB7mlqKwTSNg.png",width=500, height=500)

### Column cleanup
- Often column names will have symbols, upper and lowercase words, spaces, and typos etc.
- To make selecting data by column name easier we need to clean it up
- Here Revenue and Runtime has extra characters in brackets

In [40]:
movies.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [41]:
# rename the columns

movies.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)

movies.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

In [42]:
# Helper Function - Missing data check
def missing_data(data):
    missing = data.isnull().sum()
    available = data.count()
    total = (missing + available)
    percent = (data.isnull().sum()/data.isnull().count()*100).round(4)
    return pd.concat([missing, available, total, percent], axis=1, keys=['Missing', 'Available', 'Total', 'Percent']).sort_values(['Missing'], ascending=False)

In [43]:
missing_data(movies)

Unnamed: 0,Missing,Available,Total,Percent
Revenue_millions,128,872,1000,12.8
Metascore,64,936,1000,6.4
Rank,0,1000,1000,0.0
Title,0,1000,1000,0.0
Genre,0,1000,1000,0.0
Description,0,1000,1000,0.0
Director,0,1000,1000,0.0
Actors,0,1000,1000,0.0
Year,0,1000,1000,0.0
Runtime,0,1000,1000,0.0


In [44]:
#renaming column header to upper case
movies.columns = [col.upper() for col in movies]
movies.columns

Index(['RANK', 'TITLE', 'GENRE', 'DESCRIPTION', 'DIRECTOR', 'ACTORS', 'YEAR',
       'RUNTIME', 'RATING', 'VOTES', 'REVENUE_MILLIONS', 'METASCORE'],
      dtype='object')

In [45]:
#renaming column header to lower case
movies.columns = [col.lower() for col in movies]
movies.columns

Index(['rank', 'title', 'genre', 'description', 'director', 'actors', 'year',
       'runtime', 'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

In [46]:
#renaming column header to sentence case
movies.columns = movies.columns.map(lambda x: x.replace(x[0], chr(ord(x[0]) - 32), 1))
movies.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

### Removing null values
- Removing null values required deep understanding and context of data
- Remove only if small amount of values are null

In [47]:
movies1 = movies.dropna() #drops rows with missing values
movies1.dropna(inplace=True,axis =1) #drops columns with missing values

### Imputation

- Imputation is the process of replacing missing data with substituted values
- Most commonly **mean** or the **median** of that column is used to impute the null values

In [48]:
movies['Revenue_millions'].describe()

count    872.000000
mean      82.956376
std      103.253540
min        0.000000
25%       13.270000
50%       47.985000
75%      113.715000
max      936.630000
Name: Revenue_millions, dtype: float64

In [49]:
rev = movies['Revenue_millions']
rev.fillna(rev.mean(), inplace=True)
rev.isnull().sum()

0

### Data Understanding
- `describe()` on an entire DataFrame returns the summary of the distribution of continuous variables
- `describe()` on categorical variables returns the count of rows, unique count of categories, top category, and freq of top category
- `.value_counts()` returns the frequency of all values in a column

In [50]:
movies.describe()

Unnamed: 0,Rank,Year,Runtime,Rating,Votes,Revenue_millions,Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,96.412043,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,17.4425,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,60.375,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,99.1775,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


In [51]:
movies['Director'].describe()

count             1000
unique             644
top       Ridley Scott
freq                 8
Name: Director, dtype: object

In [52]:
movies['Director'].value_counts().head(10)

Ridley Scott          8
Paul W.S. Anderson    6
Michael Bay           6
David Yates           6
M. Night Shyamalan    6
Christopher Nolan     5
Antoine Fuqua         5
David Fincher         5
Danny Boyle           5
Denis Villeneuve      5
Name: Director, dtype: int64

### Extracting Columns | Subsetting | Setting Index | Slicing

 - Extracting columns as series
 - Extracting columns as DataFrame
 - `.loc` - **loc**ates by name
 - `.iloc`- **loc**ates by numerical **i**ndex

In [53]:
Dir = movies['Director'] # Series
type(Dir)

pandas.core.series.Series

In [54]:
Dir_col = movies[['Director']] # DataFrame. Pass a list of column names
type(Dir_col)

pandas.core.frame.DataFrame

In [55]:
subset = movies[['Genre', 'Director']]
subset.head(5)

Unnamed: 0,Genre,Director
0,"Action,Adventure,Sci-Fi",James Gunn
1,"Adventure,Mystery,Sci-Fi",Ridley Scott
2,"Horror,Thriller",M. Night Shyamalan
3,"Animation,Comedy,Family",Christophe Lourdelet
4,"Action,Adventure,Fantasy",David Ayer


In [56]:
# setting first movie name as index column
mov = movies.copy()
mov.set_index("Title", inplace = True)
mov.head(3)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue_millions,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


In [57]:
guard = mov.loc["Guardians of the Galaxy"]
guard

Rank                                                                1
Genre                                         Action,Adventure,Sci-Fi
Description         A group of intergalactic criminals are forced ...
Director                                                   James Gunn
Actors              Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
Year                                                             2014
Runtime                                                           121
Rating                                                            8.1
Votes                                                          757074
Revenue_millions                                               333.13
Metascore                                                          76
Name: Guardians of the Galaxy, dtype: object

In [58]:
guard1 = mov.iloc[0]
guard1

Rank                                                                1
Genre                                         Action,Adventure,Sci-Fi
Description         A group of intergalactic criminals are forced ...
Director                                                   James Gunn
Actors              Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
Year                                                             2014
Runtime                                                           121
Rating                                                            8.1
Votes                                                          757074
Revenue_millions                                               333.13
Metascore                                                          76
Name: Guardians of the Galaxy, dtype: object

In [59]:
mov2 = mov.iloc[1:4]
mov2

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue_millions,Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0


In [60]:
# Select on all movies directed by Michael Bay
movies[movies['Director'] == "Michael Bay"].head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue_millions,Metascore
126,127,Transformers: Age of Extinction,"Action,Adventure,Sci-Fi",Autobots must escape sight from a bounty hunte...,Michael Bay,"Mark Wahlberg, Nicola Peltz, Jack Reynor, Stan...",2014,165,5.7,255483,245.43,32.0
168,169,13 Hours,"Action,Drama,History","During an attack on a U.S. compound in Libya, ...",Michael Bay,"John Krasinski, Pablo Schreiber, James Badge D...",2016,144,7.3,76935,52.82,48.0
212,213,Transformers,"Action,Adventure,Sci-Fi",An ancient struggle between two Cybertronian r...,Michael Bay,"Shia LaBeouf, Megan Fox, Josh Duhamel, Tyrese ...",2007,144,7.1,531112,318.76,61.0
566,567,Transformers: Dark of the Moon,"Action,Adventure,Sci-Fi",The Autobots learn of a Cybertronian spacecraf...,Michael Bay,"Shia LaBeouf, Rosie Huntington-Whiteley, Tyres...",2011,154,6.3,338369,352.36,42.0
668,669,Pain & Gain,"Comedy,Crime,Drama",A trio of bodybuilders in Florida get caught u...,Michael Bay,"Mark Wahlberg, Dwayne Johnson, Anthony Mackie,...",2013,129,6.5,168875,49.87,45.0


In [61]:
# Select on all movies with rating > 9
movies[movies['Rating'] > 8].head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue_millions,Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
6,7,La La Land,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
16,17,Hacksaw Ridge,"Biography,Drama,History","WWII American Army Medic Desmond T. Doss, who ...",Mel Gibson,"Andrew Garfield, Sam Worthington, Luke Bracey,...",2016,139,8.2,211760,67.12,71.0
18,19,Lion,"Biography,Drama",A five-year-old Indian boy gets lost on the st...,Garth Davis,"Dev Patel, Nicole Kidman, Rooney Mara, Sunny P...",2016,118,8.1,102061,51.69,69.0
26,27,Bahubali: The Beginning,"Action,Adventure,Drama","In ancient India, an adventurous and daring ma...",S.S. Rajamouli,"Prabhas, Rana Daggubati, Anushka Shetty,Tamann...",2015,159,8.3,76193,6.5,


In [62]:
# movies between 2007 and 2009 with a rating > 8 and revenue > 75th percentile
movies[
    ((movies['Year'] >= 2007) & (movies['Year'] <= 2009))
    & (movies['Rating'] > 8.0)
    & (movies['Revenue_millions'] < movies['Revenue_millions'].quantile(0.75))
]

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue_millions,Metascore
136,137,No Country for Old Men,"Crime,Drama,Thriller",Violence and mayhem ensue after a hunter stumb...,Ethan Coen,"Tommy Lee Jones, Javier Bardem, Josh Brolin, W...",2007,122,8.1,660286,74.27,91.0
197,198,Into the Wild,"Adventure,Biography,Drama","After graduating from Emory University, top st...",Sean Penn,"Emile Hirsch, Vince Vaughn, Catherine Keener, ...",2007,148,8.1,459304,18.35,73.0
299,300,There Will Be Blood,"Drama,History","A story of family, religion, hatred, oil and m...",Paul Thomas Anderson,"Daniel Day-Lewis, Paul Dano, Ciarán Hinds,Mart...",2007,158,8.1,400682,40.22,92.0
430,431,3 Idiots,"Comedy,Drama",Two friends are searching for their long lost ...,Rajkumar Hirani,"Aamir Khan, Madhavan, Mona Singh, Sharman Joshi",2009,170,8.4,238789,6.52,67.0
695,696,Hachi: A Dog's Tale,"Drama,Family",A college professor's bond with the abandoned ...,Lasse Hallström,"Richard Gere, Joan Allen, Cary-Hiroyuki Tagawa...",2009,93,8.1,177602,82.956376,61.0
742,743,El secreto de sus ojos,"Drama,Mystery,Romance",A retired legal counselor writes a novel hopin...,Juan José Campanella,"Ricardo Darín, Soledad Villamil, Pablo Rago,Ca...",2009,129,8.2,144524,20.17,80.0
991,992,Taare Zameen Par,"Drama,Family,Music",An eight-year-old boy is thought to be a lazy ...,Aamir Khan,"Darsheel Safary, Aamir Khan, Tanay Chheda, Sac...",2007,165,8.5,102697,1.2,42.0


### Derived Features
- These are features/columns created from existing variables

In [63]:
def movie_len(x):
    if x > 120:
        return "long"
    else:
        return "short"

In [64]:
# applying the above function to create a new column

movies["Movie_Duration"] = movies["Runtime"].apply(movie_len)
movies.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime,Rating,Votes,Revenue_millions,Metascore,Movie_Duration
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,long
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,long
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,short
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,short
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,long


### Group By Vs Pivot Table

In [65]:
mov_group = movies.groupby(['Movie_Duration'])
mov_group #reason for below output is that group by only takes instruction to group and not display

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000029193EE10C8>

In [66]:
mov_group.size() # diaplay grouped values based on size/count

Movie_Duration
long     289
short    711
dtype: int64

In [67]:
# Adding another derived feature and applying the same
def movie_rat(x):
    if x >= 8:
        return "High"
    elif x >= 5 and x < 8:
        return "Medium"
    else:
        return "Low"

In [68]:
movies["Movie_Rating"] = movies["Rating"].apply(movie_rat)

In [69]:
mov_1 = movies.groupby(['Movie_Duration','Movie_Rating'])
m1 = mov_1.size()
m1

Movie_Duration  Movie_Rating
long            High             46
                Low               6
                Medium          237
short           High             32
                Low              37
                Medium          642
dtype: int64

In [70]:
# .unstack() method—use it to convert the results into a more readable format 
m2 = mov_1.size().unstack()
m2

Movie_Rating,High,Low,Medium
Movie_Duration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
long,46,6,237
short,32,37,642


In [71]:
movies.pivot_table(columns='Movie_Rating')

Movie_Rating,High,Low,Medium
Metascore,78.109589,42.105263,58.070303
Rank,283.820513,607.093023,514.513083
Rating,8.201282,4.276744,6.711718
Revenue_millions,140.486481,69.553748,78.506962
Runtime,128.897436,98.906977,112.474403
Votes,477216.166667,40961.232558,148832.833902
Year,2011.705128,2014.255814,2012.806598


In [72]:
movies.pivot_table(index='Movie_Rating', columns='Movie_Duration', values='Revenue_millions', aggfunc='mean')

Movie_Duration,long,short
Movie_Rating,Unnamed: 1_level_1,Unnamed: 2_level_1
High,144.871008,134.183723
Low,125.160459,60.536443
Medium,118.335022,63.80408


#### References:-
- ([Pandas Cheat Sheet](http://pandas.pydata.org))
- ([Best Practice](https://stackoverflow.com/questions/9916878/importing-modules-in-python-best-practice))
- ([Markdown](https://www.earthdatascience.org/courses/intro-to-earth-data-science/file-formats/use-text-files/format-text-with-markdown-jupyter-notebook/))
- ([Image Upload](https://stackoverflow.com/questions/32370281/how-to-embed-image-or-picture-in-jupyter-notebook-either-from-a-local-machine-o))
- ([ML Imputation Techniques](https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779))
- Maik Roder presentation on Pandas
- ([LearnDataScience](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/))


In [73]:
# Extras
# Importing the pandas package

# Reading the dataframe
df_1 = pd.read_csv('https://media-doselect.s3.amazonaws.com/generic/XWvQjYY4LZWdxLvPWOj2pPwn/heart.csv')
print('Before Change\n\n' , df_1.columns,'\n')
# Access the columns of the dataframe using df.columns and apply the following
# function which uses the ASCII values of the first character to capitalise the first character of each word. 
# As you know, the ASCII value of the lower cased
# letters start with 97 (for 'a') and for upper case letters, it starts with 65
# (for 'A'). So you just need to subtract 32.
df_1.columns = df_1.columns.map(lambda x: x.replace(x[0], chr(ord(x[0]) - 32), 1))

# Printing the final columns. Do not edit this part.
print('After Change\n\n' , df_1.columns)

Before Change

 Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object') 

After Change

 Index(['Age', 'Sex', 'Cp', 'Trestbps', 'Chol', 'Fbs', 'Restecg', 'Thalach',
       'Exang', 'Oldpeak', 'Slope', 'Ca', 'Thal', 'Target'],
      dtype='object')
