# Pandas

* Pandas is a library for analysing tabular data. It's the sort of analysis you would do on Excel, or using SQL.
* Numpy is about arrays of homogeneous data, Pandas is about columns of homogeneous data, but overall arranged in a table. 
* Pandas provides objects like DataFrame and GroupBy, which expose methods that allow us to make complex queries in an easy to understand manner.

> Resources :  
> [10 Minutes to Pandas Tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min)  
> [Other tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)  
> [The User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)  
> [The API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)  

Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

* Calculate statistics and answer questions about the data, like
* * What's the average, median, max, or min of each column?
* * Does column A correlate with column B?
* * What does the distribution of data in column C look like?
* Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
* Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
* Store the cleaned, transformed data back into a CSV, other file or database

Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas.

In [2]:
import pandas as pd
import numpy as np

## The DataFrame object

A **Series** is essentially a column, and a **DataFrame** is a multi-dimensional table made up of a collection of Series.

### Creating DataFrames from scratch

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

In [1]:
#dict
#Each (key, value) item in data corresponds to a column in the resulting DataFrame.
#pd.DataFrame()

In [None]:
#The Index of this DataFrame was given to us on creation as the numbers, 
#but we could also create our own when we initialize the DataFrame.
#pd.DataFrame(df, index=[])

In [None]:
#To read a particular row using index/name

In [6]:
# Let's read a CSV and create a DataFrame object from it
# And then let us examine some of its properties
with open("pandas_demo.csv","r") as fi:
    df=pd.read_csv(fi)

In [91]:
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,bank,birth_year,years_employed
0,Otha,Gjerde,ogjerde0@biglobe.ne.jp,Female,Zomato,,Sang Myung University,132-421-9138,HDFC,1970,24.0
1,Yance,Barnsdale,ybarnsdale1@reference.com,Male,Goldman Sachs,Social Worker,Fachhochschule Ravensburg-Weingarten,492-125-1131,SBI,1960,32.0
2,Lazarus,MacNish,lmacnish2@altervista.org,Male,TCS,Administrative Officer,"California State University, Channel Islands",235-492-4252,ICICI,1970,24.0
3,Monroe,Gwinnel,mgwinnel3@booking.com,Male,Goldman Sachs,Technical Writer,St. Mary-of-the-Woods College,,ICICI,1982,14.4
4,Gaspard,Gullivent,ggullivent4@wikia.com,Male,Infosys,Clinical Specialist,"Islamic Azad University, Quchan",232-527-4054,ICICI,1998,1.6
...,...,...,...,...,...,...,...,...,...,...,...
995,Eduino,Alessandretti,ealessandrettirn@topsy.com,Male,Infosys,Structural Engineer,Universidad Adolfo Ibáñez,354-283-0381,ICICI,1978,17.6
996,Vivianna,Rix,vrixro@sun.com,Female,Flipkart,Quality Engineer,Stonehill College,785-322-0735,Yes Bank,1990,8.0
997,Donaugh,Emmert,demmertrp@sohu.com,Male,Google,Recruiting Manager,University of the South Pacific Centre,585-826-1040,HDFC,1962,30.4
998,Reinaldo,O'Scannill,roscannillrq@desdev.cn,Male,Flipkart,VP Quality Control,Bulacan State University,268-203-9762,ICICI,1990,8.0


In [4]:
# We can change a column as index as well
df_col = pd.read_csv('pandas_demo.csv', index_col=0)
# or set index after reading a dataset as well
#df_col = df_col.set_index('last_name')
df_col

Unnamed: 0_level_0,email,gender,employer,designation,university,personal_phone,bank,birth_year,years_employed
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Gjerde,ogjerde0@biglobe.ne.jp,Female,Zomato,,Sang Myung University,132-421-9138,HDFC,1970,24.0
Barnsdale,ybarnsdale1@reference.com,Male,Goldman Sachs,Social Worker,Fachhochschule Ravensburg-Weingarten,492-125-1131,SBI,1960,32.0
MacNish,lmacnish2@altervista.org,Male,TCS,Administrative Officer,"California State University, Channel Islands",235-492-4252,ICICI,1970,24.0
Gwinnel,mgwinnel3@booking.com,Male,Goldman Sachs,Technical Writer,St. Mary-of-the-Woods College,,ICICI,1982,14.4
Gullivent,ggullivent4@wikia.com,Male,Infosys,Clinical Specialist,"Islamic Azad University, Quchan",232-527-4054,ICICI,1998,1.6
...,...,...,...,...,...,...,...,...,...
Alessandretti,ealessandrettirn@topsy.com,Male,Infosys,Structural Engineer,Universidad Adolfo Ibáñez,354-283-0381,ICICI,1978,17.6
Rix,vrixro@sun.com,Female,Flipkart,Quality Engineer,Stonehill College,785-322-0735,Yes Bank,1990,8.0
Emmert,demmertrp@sohu.com,Male,Google,Recruiting Manager,University of the South Pacific Centre,585-826-1040,HDFC,1962,30.4
O'Scannill,roscannillrq@desdev.cn,Male,Flipkart,VP Quality Control,Bulacan State University,268-203-9762,ICICI,1990,8.0


In [None]:
#can even read json file.
#pd.read_json()

In [None]:
#convert back to csv/json
#pd.to_()

In [20]:
# head()
df.head(6)

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed
0,Merle,Kenelin,mkenelin0@tamu.edu,Male,Aivee,Dental Hygienist,"Concordia University, Irvine",158-563-8593,visa-electron,1996,3.2
1,Marigold,Haddock,mhaddock1@illinois.edu,Female,Realpoint,Technical Writer,Universidad Michoacana de San Nicolás de Hidalgo,209-273-8218,jcb,1997,2.4
2,Nessy,Restorick,nrestorick2@com.com,Female,Npath,Engineer IV,,530-775-4072,visa-electron,1977,18.4
3,Madella,Lantuffe,mlantuffe3@comcast.net,Female,Thoughtbridge,Analog Circuit Design manager,"California State University, Fresno",,,1982,14.4
4,Bax,Chaudron,bchaudron4@reddit.com,Male,Shuffletag,VP Product Management,Institute of Management and Business Technology,824-449-7742,jcb,1981,15.2
5,Yehudit,Schroder,yschroder5@tuttocitta.it,Male,Browsetype,Senior Sales Associate,Universidad de Cienfuegos,530-514-8363,jcb,1991,7.2


In [21]:
df.tail(3)

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed
997,Tana,Weatherhogg,tweatherhoggrp@flickr.com,Female,Tekfly,Actuary,Universitas Wijaya Kusuma Surabaya,,,1981,15.2
998,Mikaela,Raycroft,mraycroftrq@kickstarter.com,Female,Thoughtblab,Developer II,"University of Maine, Augusta",276-706-4994,diners-club-enroute,1994,4.8
999,Jeanette,Korpal,jkorpalrr@dropbox.com,Female,Agimba,Senior Developer,"Universidad Argentina ""John F. Kennedy""",,,1988,9.6


In [22]:
df.columns

Index(['first_name', 'last_name', 'email', 'gender', 'employer', 'designation',
       'university', 'personal_phone', 'credit_card', 'birth_year',
       'years_employed'],
      dtype='object')

In [23]:
df.dtypes

first_name         object
last_name          object
email              object
gender             object
employer           object
designation        object
university         object
personal_phone     object
credit_card        object
birth_year          int64
years_employed    float64
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   first_name      1000 non-null   object 
 1   last_name       1000 non-null   object 
 2   email           1000 non-null   object 
 3   gender          1000 non-null   object 
 4   employer        1000 non-null   object 
 5   designation     939 non-null    object 
 6   university      892 non-null    object 
 7   personal_phone  783 non-null    object 
 8   bank            1000 non-null   object 
 9   birth_year      1000 non-null   int64  
 10  years_employed  1000 non-null   float64
dtypes: float64(1), int64(1), object(9)
memory usage: 86.1+ KB


In [25]:
df.describe()

Unnamed: 0,birth_year,years_employed
count,1000.0,1000.0
mean,1978.807,16.9544
std,11.341887,9.073509
min,1960.0,1.6
25%,1969.0,8.8
50%,1979.0,16.8
75%,1989.0,24.8
max,1998.0,32.0


In [26]:
df.index

RangeIndex(start=0, stop=1000, step=1)

In [27]:
df.values

array([['Merle', 'Kenelin', 'mkenelin0@tamu.edu', ..., 'visa-electron',
        1996, 3.2],
       ['Marigold', 'Haddock', 'mhaddock1@illinois.edu', ..., 'jcb',
        1997, 2.4],
       ['Nessy', 'Restorick', 'nrestorick2@com.com', ...,
        'visa-electron', 1977, 18.4],
       ...,
       ['Tana', 'Weatherhogg', 'tweatherhoggrp@flickr.com', ..., nan,
        1981, 15.2],
       ['Mikaela', 'Raycroft', 'mraycroftrq@kickstarter.com', ...,
        'diners-club-enroute', 1994, 4.8],
       ['Jeanette', 'Korpal', 'jkorpalrr@dropbox.com', ..., nan, 1988,
        9.6]], dtype=object)

In [None]:
df.shape

#### Handling Duplicates

In [None]:
# Lets create a duplicate dataframe

In [None]:
# Drop duplicate

In [None]:
# How to drop duplicates? 
# Argument: keep

In [274]:
# You can create the dataframe object using the pd.DataFrame() constructor
# Here is one way

a=pd.DataFrame(index=range(0,10),columns=["A","B","C"],data=np.random.randint(1,10,(10,3)))
display(a)

Unnamed: 0,A,B,C
0,1,1,2
1,4,6,2
2,5,3,8
3,4,8,6
4,4,9,5
5,4,4,4
6,2,5,1
7,9,8,5
8,9,4,3
9,2,7,9


In [None]:
# Rename a column
#a.rename(columns={}, inplace)

In [None]:
#a.columns = 

## Reading data

### The main interface

In [29]:
# Columns can be accessed using dot notation
df.birth_year

0      1996
1      1997
2      1977
3      1982
4      1981
       ... 
995    1975
996    1977
997    1981
998    1994
999    1988
Name: birth_year, Length: 1000, dtype: int64

In [11]:
# Columns can also be indexed - This will return a series
df["birth_year"]

0      1970
1      1960
2      1970
3      1982
4      1998
       ... 
995    1978
996    1990
997    1962
998    1990
999    1978
Name: birth_year, Length: 1000, dtype: int64

In [38]:
# Multiple columns can also be indexed - This will return a DataFrame
df[["first_name","birth_year"]]

Unnamed: 0,first_name,birth_year
0,Merle,1996
1,Marigold,1997
2,Nessy,1977
3,Madella,1982
4,Bax,1981
...,...,...
995,Christye,1975
996,Beltran,1977
997,Tana,1981
998,Mikaela,1994


In [54]:
# You can slice the rows by position
sliced=df[50:100]
# Note that the index still keeps the old numbers
# Thus index itself is kind of like a series - it's not just numbering from top to bottom.
# Actually, the index can be strings, dates anything
# Though for simplicity we will stick to integer indexes
print(sliced.index)
sliced.head()

RangeIndex(start=50, stop=100, step=1)


Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed
50,Ceciley,Breakey,cbreakey1e@slideshare.net,Female,Camimbo,Systems Administrator IV,Instituto Superior de Relações Internacionais ...,269-104-0079,jcb,1985,12.0
51,Walliw,Ley,wley1f@java.com,Female,Photofeed,Human Resources Manager,Massachusetts Institute of Technology,363-227-4335,jcb,1985,12.0
52,Tod,McInteer,tmcinteer1g@shareasale.com,Male,Tambee,Software Consultant,,727-221-1362,laser,1980,16.0
53,Charmine,Hearon,chearon1h@harvard.edu,Female,Topiclounge,Chief Design Engineer,Manchester College,478-235-7826,laser,1984,12.8
54,Rikki,Verbrugge,rverbrugge1i@usnews.com,Male,Geba,Systems Administrator I,State Pedagogical University in Kryvyi Rih,111-278-6472,bankcard,1982,14.4


In [53]:
# You can chain the two indexing style
df[:50][["first_name","last_name"]].head(5)

Unnamed: 0,first_name,last_name
0,Merle,Kenelin
1,Marigold,Haddock
2,Nessy,Restorick
3,Madella,Lantuffe
4,Bax,Chaudron


To summarise, indexing the dataframe directly like ``df[]`` allows
* indexing a single column by name (or label)
* indexing a list of columns by name 
* slicing of rows by position.

Note that the following things doesn't work
* indexing columns by position
* indexing/slicing rows by name/label

As we will see, there are two other interfaces that give us this functionality.

### .loc

In [13]:
# Index a column by name same as before
# Notice that now you have to compulsarily provide a slice for the rows
df.loc[:,["first_name","last_name"]]
#df.loc[:, "first_name"]

0           Otha
1          Yance
2        Lazarus
3         Monroe
4        Gaspard
         ...    
995       Eduino
996     Vivianna
997      Donaugh
998     Reinaldo
999    Justinian
Name: first_name, Length: 1000, dtype: object

In [59]:
# But let's look at rows
sliced=df[50:100]
# Note the index goes from 50 to 100
sliced.head()

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed
50,Ceciley,Breakey,cbreakey1e@slideshare.net,Female,Camimbo,Systems Administrator IV,Instituto Superior de Relações Internacionais ...,269-104-0079,jcb,1985,12.0
51,Walliw,Ley,wley1f@java.com,Female,Photofeed,Human Resources Manager,Massachusetts Institute of Technology,363-227-4335,jcb,1985,12.0
52,Tod,McInteer,tmcinteer1g@shareasale.com,Male,Tambee,Software Consultant,,727-221-1362,laser,1980,16.0
53,Charmine,Hearon,chearon1h@harvard.edu,Female,Topiclounge,Chief Design Engineer,Manchester College,478-235-7826,laser,1984,12.8
54,Rikki,Verbrugge,rverbrugge1i@usnews.com,Male,Geba,Systems Administrator I,State Pedagogical University in Kryvyi Rih,111-278-6472,bankcard,1982,14.4


In [60]:
# We can index by position with the main interface - for example : 0:5 returns 50-54
sliced[0:5]

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed
50,Ceciley,Breakey,cbreakey1e@slideshare.net,Female,Camimbo,Systems Administrator IV,Instituto Superior de Relações Internacionais ...,269-104-0079,jcb,1985,12.0
51,Walliw,Ley,wley1f@java.com,Female,Photofeed,Human Resources Manager,Massachusetts Institute of Technology,363-227-4335,jcb,1985,12.0
52,Tod,McInteer,tmcinteer1g@shareasale.com,Male,Tambee,Software Consultant,,727-221-1362,laser,1980,16.0
53,Charmine,Hearon,chearon1h@harvard.edu,Female,Topiclounge,Chief Design Engineer,Manchester College,478-235-7826,laser,1984,12.8
54,Rikki,Verbrugge,rverbrugge1i@usnews.com,Male,Geba,Systems Administrator I,State Pedagogical University in Kryvyi Rih,111-278-6472,bankcard,1982,14.4


In [61]:
# But can we slice with respect to the original index?

# This will return an empty DF, as position is only upto 50
sliced[55:60]

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed


In [63]:
# But this works well
sliced.loc[55:60]

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed
55,Ulrica,Widdison,uwiddison1j@va.gov,Female,Edgeblab,Actuary,Universidade Gregorio Semedo,132-476-4488,diners-club-enroute,1990,8.0
56,Gael,Paddock,gpaddock1k@yahoo.com,Male,Trilia,Speech Pathologist,Touro College,565-720-1987,maestro,1960,32.0
57,Kelwin,Goodger,kgoodger1l@github.io,Male,Yakidoo,Product Engineer,University of Belgrade,,,1987,10.4
58,Rustin,Jeffes,rjeffes1m@archive.org,Male,Jabbersphere,Structural Analysis Engineer,Ecole Normale Supérieure de Cachan,211-888-9536,visa,1989,8.8
59,Barris,Kiossel,bkiossel1n@ebay.co.uk,Male,Feedfish,Physical Therapy Assistant,State University of New York College at Geneseo,340-166-6123,jcb,1979,16.8
60,Dulcine,Coughlin,dcoughlin1o@blogtalkradio.com,Female,Meembee,VP Marketing,University of Trieste,159-922-2968,jcb,1961,31.2


In [64]:
# Sometimes the original index is useful, especially if you want to modify the slice
# and assign to the original DataFrame (it automatically matches the indexes)

# But sometimes after a slice you want to reset the index
sliced=sliced.reset_index()
sliced

Unnamed: 0,index,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed
0,50,Ceciley,Breakey,cbreakey1e@slideshare.net,Female,Camimbo,Systems Administrator IV,Instituto Superior de Relações Internacionais ...,269-104-0079,jcb,1985,12.0
1,51,Walliw,Ley,wley1f@java.com,Female,Photofeed,Human Resources Manager,Massachusetts Institute of Technology,363-227-4335,jcb,1985,12.0
2,52,Tod,McInteer,tmcinteer1g@shareasale.com,Male,Tambee,Software Consultant,,727-221-1362,laser,1980,16.0
3,53,Charmine,Hearon,chearon1h@harvard.edu,Female,Topiclounge,Chief Design Engineer,Manchester College,478-235-7826,laser,1984,12.8
4,54,Rikki,Verbrugge,rverbrugge1i@usnews.com,Male,Geba,Systems Administrator I,State Pedagogical University in Kryvyi Rih,111-278-6472,bankcard,1982,14.4
5,55,Ulrica,Widdison,uwiddison1j@va.gov,Female,Edgeblab,Actuary,Universidade Gregorio Semedo,132-476-4488,diners-club-enroute,1990,8.0
6,56,Gael,Paddock,gpaddock1k@yahoo.com,Male,Trilia,Speech Pathologist,Touro College,565-720-1987,maestro,1960,32.0
7,57,Kelwin,Goodger,kgoodger1l@github.io,Male,Yakidoo,Product Engineer,University of Belgrade,,,1987,10.4
8,58,Rustin,Jeffes,rjeffes1m@archive.org,Male,Jabbersphere,Structural Analysis Engineer,Ecole Normale Supérieure de Cachan,211-888-9536,visa,1989,8.8
9,59,Barris,Kiossel,bkiossel1n@ebay.co.uk,Male,Feedfish,Physical Therapy Assistant,State University of New York College at Geneseo,340-166-6123,jcb,1979,16.8


### .iloc

In [66]:
# This is good ol' numpy style indexing 
# By position on rows
# and By position on columns
df.iloc[5:7,:3]

Unnamed: 0,first_name,last_name,email
5,Yehudit,Schroder,yschroder5@tuttocitta.it
6,Jareb,Jakoviljevic,jjakoviljevic6@dmoz.org


In [70]:
# A value
df.iloc[5,1]

'Schroder'

In [72]:
# A series
df.iloc[:5,1]

0      Kenelin
1      Haddock
2    Restorick
3     Lantuffe
4     Chaudron
Name: last_name, dtype: object

In [71]:
# also a series
df.iloc[5,1:6]

last_name                      Schroder
email          yschroder5@tuttocitta.it
gender                             Male
employer                     Browsetype
designation      Senior Sales Associate
Name: 5, dtype: object

In [16]:
#df.iloc[:,["first_name"]]


#### Handle Missing Values

There are two options in dealing with nulls: 

* Get rid of rows or columns with nulls
* Replace nulls with non-null values, a technique known as imputation

In [21]:
# isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.
df.isnull()

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,bank,birth_year,years_employed
0,False,False,False,False,False,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False,False,False,False
996,False,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False,False


In [23]:
df.isnull().sum()

first_name          0
last_name           0
email               0
gender              0
employer            0
designation        61
university        108
personal_phone    217
bank                0
birth_year          0
years_employed      0
dtype: int64

In [24]:
# Removing Null Values
df.dropna()

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,bank,birth_year,years_employed
1,Yance,Barnsdale,ybarnsdale1@reference.com,Male,Goldman Sachs,Social Worker,Fachhochschule Ravensburg-Weingarten,492-125-1131,SBI,1960,32.0
2,Lazarus,MacNish,lmacnish2@altervista.org,Male,TCS,Administrative Officer,"California State University, Channel Islands",235-492-4252,ICICI,1970,24.0
4,Gaspard,Gullivent,ggullivent4@wikia.com,Male,Infosys,Clinical Specialist,"Islamic Azad University, Quchan",232-527-4054,ICICI,1998,1.6
5,Karna,Climar,kclimar5@boston.com,Female,Flipkart,Teacher,Patna University,765-164-9307,SBI,1968,25.6
7,Peria,Beverage,pbeverage7@sakura.ne.jp,Female,Flipkart,Analyst Programmer,"Université Abou Bekr Belkaid, Tlemcen",876-185-6742,SBI,1971,23.2
...,...,...,...,...,...,...,...,...,...,...,...
995,Eduino,Alessandretti,ealessandrettirn@topsy.com,Male,Infosys,Structural Engineer,Universidad Adolfo Ibáñez,354-283-0381,ICICI,1978,17.6
996,Vivianna,Rix,vrixro@sun.com,Female,Flipkart,Quality Engineer,Stonehill College,785-322-0735,Yes Bank,1990,8.0
997,Donaugh,Emmert,demmertrp@sohu.com,Male,Google,Recruiting Manager,University of the South Pacific Centre,585-826-1040,HDFC,1962,30.4
998,Reinaldo,O'Scannill,roscannillrq@desdev.cn,Male,Flipkart,VP Quality Control,Bulacan State University,268-203-9762,ICICI,1990,8.0


In [None]:
# How to drop NA values?
# Argument: Axis

In [None]:
# Imputation
# Get a column => assign value for na celss => fillna()

### Boolean indexing

In [None]:
# boolean indexing by column
print(df['birth_year']>=1990)
# multiple conditions
# using isin()

In [25]:
# boolean indexing by column
print(df['birth_year']>=1990)
only_90s_kids=df[df.birth_year>=1990]
print(only_90s_kids)
only_90s_kids=only_90s_kids[only_90s_kids.birth_year<=2000]
print(len(only_90s_kids))
only_90s_kids.head(5)

0      False
1      False
2      False
3      False
4       True
       ...  
995    False
996     True
997    False
998     True
999    False
Name: birth_year, Length: 1000, dtype: bool
    first_name   last_name                        email  gender  \
4      Gaspard   Gullivent        ggullivent4@wikia.com    Male   
11     Hermann       Dewen           hdewenb@hao123.com    Male   
14       Anica  MacIlhargy         amacilhargye@nih.gov  Female   
18       Hermy  Lambertini   hlambertinii@bloglovin.com    Male   
20        Amii   Sacchetti         asacchettik@ucla.edu  Female   
..         ...         ...                          ...     ...   
981     Gillie    Jikovsky         gjikovskyr9@usda.gov  Female   
986      Helge     Gambell      hgambellre@netvibes.com  Female   
989   Carolyne  Reitenbach  creitenbachrh@canalblog.com  Female   
996   Vivianna         Rix               vrixro@sun.com  Female   
998   Reinaldo  O'Scannill       roscannillrq@desdev.cn    Male   

        

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,bank,birth_year,years_employed
4,Gaspard,Gullivent,ggullivent4@wikia.com,Male,Infosys,Clinical Specialist,"Islamic Azad University, Quchan",232-527-4054,ICICI,1998,1.6
11,Hermann,Dewen,hdewenb@hao123.com,Male,Zomato,VP Quality Control,Universidade de Caxias do Sul,,Yes Bank,1990,8.0
14,Anica,MacIlhargy,amacilhargye@nih.gov,Female,Zomato,Analyst Programmer,Capitol University,578-536-7874,Yes Bank,1997,2.4
18,Hermy,Lambertini,hlambertinii@bloglovin.com,Male,TCS,Environmental Tech,Warnborough University,715-183-6507,SBI,1994,4.8
20,Amii,Sacchetti,asacchettik@ucla.edu,Female,Infosys,Recruiting Manager,Hyupsung University,,ICICI,1997,2.4


## Setting data

In [78]:
# The most common operation : computing a new column based on other columns

only_90s_kids["age"] = 2020 - only_90s_kids.birth_year

# Notice how the index is preserved
only_90s_kids.head(5)

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed,age
0,Merle,Kenelin,mkenelin0@tamu.edu,Male,Aivee,Dental Hygienist,"Concordia University, Irvine",158-563-8593,visa-electron,1996,3.2,24
1,Marigold,Haddock,mhaddock1@illinois.edu,Female,Realpoint,Technical Writer,Universidad Michoacana de San Nicolás de Hidalgo,209-273-8218,jcb,1997,2.4,23
5,Yehudit,Schroder,yschroder5@tuttocitta.it,Male,Browsetype,Senior Sales Associate,Universidad de Cienfuegos,530-514-8363,jcb,1991,7.2,29
35,Karlens,Dubois,kduboisz@bandcamp.com,Male,Shufflebeat,Recruiting Manager,Shorter College,615-785-9997,jcb,1996,3.2,24
40,Riordan,Jacobowits,rjacobowits14@lulu.com,Male,Realbridge,GIS Technical Architect,"California State University, Monterey Bay",568-830-5897,jcb,1991,7.2,29


In [82]:
# You could even have assigned it to the original df

# but let's copy it first ;)
fresh_df=df.copy()

# All the correct indices will be filled otherwise it will be set as NaN.
# This is one of the use cases of having a separate index object
# vs just position numbering like numpy
fresh_df["age"] = 2020 - only_90s_kids.birth_year
fresh_df

Unnamed: 0,first_name,last_name,email,gender,employer,designation,university,personal_phone,credit_card,birth_year,years_employed,age
0,Merle,Kenelin,mkenelin0@tamu.edu,Male,Aivee,Dental Hygienist,"Concordia University, Irvine",158-563-8593,visa-electron,1996,3.2,24.0
1,Marigold,Haddock,mhaddock1@illinois.edu,Female,Realpoint,Technical Writer,Universidad Michoacana de San Nicolás de Hidalgo,209-273-8218,jcb,1997,2.4,23.0
2,Nessy,Restorick,nrestorick2@com.com,Female,Npath,Engineer IV,,530-775-4072,visa-electron,1977,18.4,
3,Madella,Lantuffe,mlantuffe3@comcast.net,Female,Thoughtbridge,Analog Circuit Design manager,"California State University, Fresno",,,1982,14.4,
4,Bax,Chaudron,bchaudron4@reddit.com,Male,Shuffletag,VP Product Management,Institute of Management and Business Technology,824-449-7742,jcb,1981,15.2,
...,...,...,...,...,...,...,...,...,...,...,...,...
995,Christye,Boreham,cborehamrn@rambler.ru,Female,Twitterlist,Director of Sales,,469-619-6625,jcb,1975,20.0,
996,Beltran,Elvy,belvyro@forbes.com,Male,Skaboo,Software Consultant,Tabari Institute of Higher Education,124-721-6651,jcb,1977,18.4,
997,Tana,Weatherhogg,tweatherhoggrp@flickr.com,Female,Tekfly,Actuary,Universitas Wijaya Kusuma Surabaya,,,1981,15.2,
998,Mikaela,Raycroft,mraycroftrq@kickstarter.com,Female,Thoughtblab,Developer II,"University of Maine, Augusta",276-706-4994,diners-club-enroute,1994,4.8,26.0


In [87]:
# You can of course set values at a particular position
print(fresh_df.iloc[0,-1]) 
fresh_df.iloc[0,-1]=29
print(fresh_df.iloc[0,-1]) 

29.0
29.0


# Data Investigation Example 1

Let's make a set of questions and then see how we can answer them by using pandas

* Who are the employers?
* What is the employee count of each employer?
* What is the % of Male vs Female employees for each company?
* What is the average age of an employee for each company?
* Which bank is most popular in each company?

In [96]:
df[:]["employer"].drop_duplicates()

0            Zomato
1     Goldman Sachs
2               TCS
4           Infosys
5          Flipkart
31           Google
Name: employer, dtype: object

In [98]:
grps=df.groupby(by="employer")

In [101]:
grps.groups

{'Flipkart': Int64Index([  5,   7,  13,  21,  25,  27,  33,  40,  50,  55,
             ...
             957, 962, 966, 970, 981, 985, 991, 996, 998, 999],
            dtype='int64', length=167),
 'Goldman Sachs': Int64Index([  1,   3,   6,  15,  16,  22,  23,  24,  26,  36,
             ...
             951, 956, 960, 961, 973, 974, 975, 982, 986, 988],
            dtype='int64', length=168),
 'Google': Int64Index([ 31,  34,  43,  60,  62,  64,  68,  69,  70,  75,
             ...
             924, 932, 934, 950, 958, 964, 978, 979, 989, 997],
            dtype='int64', length=186),
 'Infosys': Int64Index([  4,   8,  20,  29,  32,  35,  38,  48,  56,  65,
             ...
             953, 954, 968, 969, 971, 977, 980, 987, 990, 995],
            dtype='int64', length=162),
 'TCS': Int64Index([  2,  18,  19,  30,  39,  42,  45,  51,  53,  54,
             ...
             931, 933, 935, 939, 949, 952, 963, 965, 983, 993],
            dtype='int64', length=168),
 'Zomato': Int64Index([

In [104]:
# Notice that it counts non-empty values
grps.count()

Unnamed: 0_level_0,first_name,last_name,email,gender,designation,university,personal_phone,bank,birth_year,years_employed
employer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Flipkart,167,167,167,167,158,155,134,167,167,167
Goldman Sachs,168,168,168,168,151,152,131,168,168,168
Google,186,186,186,186,175,163,146,186,186,186
Infosys,162,162,162,162,155,144,120,162,162,162
TCS,168,168,168,168,163,149,139,168,168,168
Zomato,149,149,149,149,137,129,113,149,149,149


In [19]:
# So how to get the number of rows in each group?

# One way:
fresh=df.copy()
fresh["id"]=np.arange(0,len(fresh))
display(fresh.groupby("employer").count()["id"])

# Another simpler way :)
display(grps.size())

# Remember a little creativity goes a long way

employer
Flipkart         167
Goldman Sachs    168
Google           186
Infosys          162
TCS              168
Zomato           149
Name: id, dtype: int64

NameError: name 'grps' is not defined

In [123]:
grps=df.groupby(by=["employer","gender"])

# This is an example of a hierarchical index for a series
display(grps.size())

employer       gender
Flipkart       Female    86
               Male      81
Goldman Sachs  Female    85
               Male      83
Google         Female    91
               Male      95
Infosys        Female    82
               Male      80
TCS            Female    92
               Male      76
Zomato         Female    72
               Male      77
dtype: int64

In [124]:
counts=pd.DataFrame(grps.size())
print(counts.to_records())
counts=pd.DataFrame(counts.to_records())
display(counts)

[('Flipkart', 'Female', 86) ('Flipkart', 'Male', 81)
 ('Goldman Sachs', 'Female', 85) ('Goldman Sachs', 'Male', 83)
 ('Google', 'Female', 91) ('Google', 'Male', 95) ('Infosys', 'Female', 82)
 ('Infosys', 'Male', 80) ('TCS', 'Female', 92) ('TCS', 'Male', 76)
 ('Zomato', 'Female', 72) ('Zomato', 'Male', 77)]


Unnamed: 0,employer,gender,0
0,Flipkart,Female,86
1,Flipkart,Male,81
2,Goldman Sachs,Female,85
3,Goldman Sachs,Male,83
4,Google,Female,91
5,Google,Male,95
6,Infosys,Female,82
7,Infosys,Male,80
8,TCS,Female,92
9,TCS,Male,76


In [126]:
with_age=df.copy()
with_age["age"]=2020 - with_age.birth_year
with_age[["employer","age"]].groupby("employer").mean()

Unnamed: 0_level_0,age
employer,Unnamed: 1_level_1
Flipkart,41.616766
Goldman Sachs,41.047619
Google,43.0
Infosys,42.382716
TCS,39.005952
Zomato,42.651007


In [144]:
temp=df[["employer","bank"]].groupby(["employer","bank"]).size()
temp=pd.DataFrame(pd.DataFrame(temp).to_records())
temp=temp.rename(columns={"0":"count"})
temp=temp.sort_values('count',ascending=False)
display(temp)

Unnamed: 0,employer,bank,count
8,Google,HDFC,55
10,Google,SBI,55
14,Infosys,SBI,51
2,Flipkart,SBI,47
0,Flipkart,HDFC,46
17,TCS,ICICI,45
18,TCS,SBI,45
4,Goldman Sachs,HDFC,44
22,Zomato,SBI,44
16,TCS,HDFC,43


In [147]:
# Some more creativity
display(temp.drop_duplicates(subset=["employer"],keep="first"))

Unnamed: 0,employer,bank,count
8,Google,HDFC,55
14,Infosys,SBI,51
2,Flipkart,SBI,47
17,TCS,ICICI,45
4,Goldman Sachs,HDFC,44
22,Zomato,SBI,44
