# Pandas
## What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

## Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

In [1]:
#!pip install pandas

In [7]:
import pandas as pd
pd.__version__

'1.4.1'

DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

In [15]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford","Porsche"],
  'establish': [1922, 1915, 1903,1931]
}

df = pd.DataFrame(mydataset)

print(df)

      cars  establish
0      BMW       1922
1    Volvo       1915
2     Ford       1903
3  Porsche       1931


### What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [9]:
import pandas as pd
a = [1, 7, 2]
df = pd.Series(a)
print(df)

0    1
1    7
2    2
dtype: int64


If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [10]:
print(df[0])

1


In [11]:
import pandas as pd
a = [1, 7, 2]
df = pd.Series(a,index=['a','b','c'])
print(df)

a    1
b    7
c    2
dtype: int64


Key/Value Objects as Series

In [12]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

df = pd.Series(calories)

print(df)

day1    420
day2    380
day3    390
dtype: int64


Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [16]:
df.iloc[0]

cars          BMW
establish    1922
Name: 0, dtype: object

In [18]:
df.loc[[0, 1]]

Unnamed: 0,cars,establish
0,BMW,1922
1,Volvo,1915


*Note:* When using [], the result is a Pandas DataFrame.

In [26]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

In [25]:
df.loc["day2"]

calories    380
duration     40
Name: day2, dtype: int64

### Load Files Into a DataFrame

In [43]:
df = pd.read_csv('howlongwelive.csv')
df.head(5)

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


### Info About the Data

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

In [45]:
df.isnull().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

Data Cleaning : fixing bad data in your data set.

Bad data could be:

- Empty cells
- Data in wrong format
- Wrong data
- Duplicates

In [34]:
df.shape

(4981, 11)

#### Load dataset

In [60]:
df = pd.read_csv('howlongwelive.csv')
new_df = df.dropna()
new_df.shape

(1649, 22)

In [61]:
df.dropna(inplace = True)
df.shape

(1649, 22)

In [68]:
df.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0
mean,2007.840509,69.302304,168.215282,32.553062,4.533196,698.973558,79.217708,2224.494239,38.128623,44.220133,83.564585,5.955925,84.155246,1.983869,5566.031887,14653630.0,4.850637,4.907762,0.631551,12.119891
std,4.087711,8.796834,125.310417,120.84719,4.029189,1759.229336,25.604664,10085.802019,19.754249,162.897999,22.450557,2.299385,21.579193,6.03236,11475.900117,70460390.0,4.599228,4.653757,0.183089,2.795388
min,2000.0,44.0,1.0,0.0,0.01,0.0,2.0,0.0,2.0,0.0,3.0,0.74,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,4.2
25%,2005.0,64.4,77.0,1.0,0.81,37.438577,74.0,0.0,19.5,1.0,81.0,4.41,82.0,0.1,462.14965,191897.0,1.6,1.7,0.509,10.3
50%,2008.0,71.7,148.0,3.0,3.79,145.102253,89.0,15.0,43.7,4.0,93.0,5.84,92.0,0.1,1592.572182,1419631.0,3.0,3.2,0.673,12.3
75%,2011.0,75.0,227.0,22.0,7.34,509.389994,96.0,373.0,55.8,29.0,97.0,7.47,97.0,0.7,4718.51291,7658972.0,7.1,7.1,0.751,14.0
max,2015.0,89.0,723.0,1600.0,17.87,18961.3486,99.0,131441.0,77.1,2100.0,99.0,14.39,99.0,50.6,119172.7418,1293859000.0,27.2,28.2,0.936,20.7


In [67]:
df.corr()
#Note: The corr() method ignores "not numeric" columns.

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
Year,1.0,0.050771,-0.037092,0.008029,-0.113365,0.069553,0.114897,-0.053822,0.005739,0.010479,-0.016699,0.059493,0.029641,-0.123405,0.096421,0.012567,0.019757,0.014122,0.122892,0.088732
Life expectancy,0.050771,1.0,-0.702523,-0.169074,0.402718,0.409631,0.199935,-0.068881,0.542042,-0.192265,0.327294,0.174718,0.341331,-0.592236,0.441322,-0.022305,-0.457838,-0.457508,0.721083,0.72763
Adult Mortality,-0.037092,-0.702523,1.0,0.04245,-0.175535,-0.23761,-0.105225,-0.003967,-0.351542,0.060365,-0.199853,-0.085227,-0.191429,0.550691,-0.255035,-0.015012,0.27223,0.286723,-0.442203,-0.421171
infant deaths,0.008029,-0.169074,0.04245,1.0,-0.106217,-0.090765,-0.231769,0.53268,-0.234425,0.996906,-0.156929,-0.146951,-0.161871,0.007712,-0.098092,0.671758,0.463415,0.461908,-0.134754,-0.214372
Alcohol,-0.113365,0.402718,-0.175535,-0.106217,1.0,0.417047,0.109889,-0.05011,0.353396,-0.101082,0.240315,0.214885,0.242951,-0.027113,0.443433,-0.02888,-0.403755,-0.386208,0.561074,0.616975
percentage expenditure,0.069553,0.409631,-0.23761,-0.090765,0.417047,1.0,0.01676,-0.063071,0.242738,-0.092158,0.128626,0.183872,0.134813,-0.095085,0.959299,-0.016792,-0.255035,-0.255635,0.40217,0.422088
Hepatitis B,0.114897,0.199935,-0.105225,-0.231769,0.109889,0.01676,1.0,-0.1248,0.143302,-0.240766,0.463331,0.113327,0.58899,-0.094802,0.04185,-0.129723,-0.129406,-0.133251,0.184921,0.215182
Measles,-0.053822,-0.068881,-0.003967,0.53268,-0.05011,-0.063071,-0.1248,1.0,-0.153245,0.517506,-0.05785,-0.113583,-0.058606,-0.003522,-0.064768,0.321946,0.180642,0.174946,-0.058277,-0.11566
BMI,0.005739,0.542042,-0.351542,-0.234425,0.353396,0.242738,0.143302,-0.153245,1.0,-0.242137,0.186268,0.189469,0.176295,-0.210897,0.266114,-0.081416,-0.547018,-0.554094,0.510505,0.554844
under-five deaths,0.010479,-0.192265,0.060365,0.996906,-0.101082,-0.092158,-0.240766,0.517506,-0.242137,1.0,-0.171164,-0.145803,-0.178448,0.019476,-0.100331,0.65868,0.464785,0.462289,-0.148097,-0.226013


### Replace Empty Values

### Replace Using Mean, Median, or Mode

Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

In [64]:
mean_life = df['Life expectancy '].mean()
median_life = df['Life expectancy '].median()
mode_life = df['Life expectancy '].mode()[0]
mean_life, median_life, mode_life

(69.3023044269254, 71.7, 73.0)

In [52]:
df['Life expectancy '].fillna(mean_life, inplace = True)

In [54]:
df['Life expectancy '].isnull().sum()

0