<a href="https://colab.research.google.com/github/keshavvprabhu/python-tutorials/blob/main/PandasExploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Pandas Documentation with a sample Dataset

## Reading the dataset
For this example, we are considering the imdbratings dataset

In [39]:
import pandas as pd

df = pd.read_csv('http://bit.ly/imdbratings')
df.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [40]:
df[(df['duration'] > 200) & (df['genre'].isin(['Drama', 'Crime']))]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
17,8.7,Seven Samurai,UNRATED,Drama,207,"[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."
476,7.8,Hamlet,PG-13,Drama,242,"[u'Kenneth Branagh', u'Julie Christie', u'Dere..."


## How do I use the "axis" parameter in pandas

## Dropping a column

In [41]:
df.drop('actors_list', axis=1).head()

Unnamed: 0,star_rating,title,content_rating,genre,duration
0,9.3,The Shawshank Redemption,R,Crime,142
1,9.2,The Godfather,R,Crime,175
2,9.1,The Godfather: Part II,R,Crime,200
3,9.0,The Dark Knight,PG-13,Action,152
4,8.9,Pulp Fiction,R,Crime,154


You may also use axis="columns" or axis=1 interchangeably

In [42]:
df.drop('actors_list', axis="columns").head()

Unnamed: 0,star_rating,title,content_rating,genre,duration
0,9.3,The Shawshank Redemption,R,Crime,142
1,9.2,The Godfather,R,Crime,175
2,9.1,The Godfather: Part II,R,Crime,200
3,9.0,The Dark Knight,PG-13,Action,152
4,8.9,Pulp Fiction,R,Crime,154


## Dropping a Row

You may also use axis=0 or axis='index' interchangeably

In [43]:
df.drop(2,axis=0).head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
5,8.9,12 Angry Men,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."


### Remember

*   axis=0 => Row Operation         - You can also use axis='index'
*   axis=1 => Column Operation      - You can also use axis='columns'


In [44]:
df.describe()

Unnamed: 0,star_rating,duration
count,979.0,979.0
mean,7.889785,120.979571
std,0.336069,26.21801
min,7.4,64.0
25%,7.6,102.0
50%,7.8,117.0
75%,8.1,134.0
max,9.3,242.0


In [45]:
df.describe(include='all')

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
count,979.0,979,976,979,979.0,979
unique,,975,12,16,,969
top,,Dracula,R,Drama,,"[u'Daniel Radcliffe', u'Emma Watson', u'Rupert..."
freq,,2,460,278,,6
mean,7.889785,,,,120.979571,
std,0.336069,,,,26.21801,
min,7.4,,,,64.0,
25%,7.6,,,,102.0,
50%,7.8,,,,117.0,
75%,8.1,,,,134.0,


In [46]:
df.nunique()

star_rating        20
title             975
content_rating     12
genre              16
duration          133
actors_list       969
dtype: int64

In [47]:
df.isnull().sum()

star_rating       0
title             0
content_rating    3
genre             0
duration          0
actors_list       0
dtype: int64

In [48]:
df_columns = df.columns
set_df_columns = set(df.columns)
print(set_df_columns)
set_key_columns = set(['title', 'genre'])
print(set_key_columns)
list_key_columns = list(set_key_columns)
set_value_columns = set_df_columns - set_key_columns
print(set_value_columns)
list_value_columns = list(set_value_columns)
df_melted = pd.melt(df, id_vars=list_key_columns, value_vars=list_value_columns)
df_shawshank = df_melted[df_melted.title == 'The Shawshank Redemption']
df_shawshank

{'actors_list', 'duration', 'genre', 'content_rating', 'star_rating', 'title'}
{'genre', 'title'}
{'star_rating', 'content_rating', 'actors_list', 'duration'}


Unnamed: 0,genre,title,variable,value
0,Crime,The Shawshank Redemption,star_rating,9.3
979,Crime,The Shawshank Redemption,content_rating,R
1958,Crime,The Shawshank Redemption,actors_list,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
2937,Crime,The Shawshank Redemption,duration,142


## Now let us try out the dataframe pivot() function

With the pivot(), we are able do do the converse of what we accomplished through the melt() function

In [80]:
df_pivoted_shawshank = df_shawshank.pivot(index=['genre', 'title'], columns = 'variable', values='value')
df_pivoted_shawshank

Unnamed: 0_level_0,variable,actors_list,content_rating,duration,star_rating
genre,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Crime,The Shawshank Redemption,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",R,142,9.3


Not quite. We still need to do some more operations...

In [79]:
df_pivoted_shawshank.reset_index(inplace=True)
df_pivoted_shawshank[['star_rating', 'title', 'content_rating', 'genre', 'duration', 'actors_list']]

variable,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."


In [77]:
df[df['title'] == 'The Shawshank Redemption']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."


Finally, we have some resemblance.