**Pandas** <br>
Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [125]:
import pandas as pd

In [126]:
df = pd.read_csv("./data_files/gapminderDataFiveYear.csv")

In [129]:
type(df)

pandas.core.frame.DataFrame

In [132]:
df.shape # rows x col

(1704, 6)

In [134]:
df.shape() #shape is an attribute but not method

TypeError: 'tuple' object is not callable

In [136]:
#check col names
df.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

In [137]:
#what is the type of the columns names

df.dtypes

country       object
year           int64
pop          float64
continent     object
lifeExp      float64
gdpPercap    float64
dtype: object

In [140]:
#get more info on data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
year         1704 non-null int64
pop          1704 non-null float64
continent    1704 non-null object
lifeExp      1704 non-null float64
gdpPercap    1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB


**Looking for columns, Rows and cells**

**Subnetting columns**: Two ways one, by names and second, by Index positions.<br>


In [141]:
#By names
#show first 5 rows
df.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


In [142]:
df.tail()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
1699,Zimbabwe,1987,9216418.0,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340.0,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948.0,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563.0,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143.0,Africa,43.487,469.709298


In [144]:
subset= df[['country','continent','year']]
subset.head()

Unnamed: 0,country,continent,year
0,Afghanistan,Asia,1952
1,Afghanistan,Asia,1957
2,Afghanistan,Asia,1962
3,Afghanistan,Asia,1967
4,Afghanistan,Asia,1972


In [155]:
#by Index positions
#
# first check the column name and then print the column
df.columns[1] # column name

'year'

In [160]:
# print the column first
df [ df.columns[1] ] 

0       1952
1       1957
2       1962
3       1967
4       1972
5       1977
6       1982
7       1987
8       1992
9       1997
10      2002
11      2007
12      1952
13      1957
14      1962
15      1967
16      1972
17      1977
18      1982
19      1987
20      1992
21      1997
22      2002
23      2007
24      1952
25      1957
26      1962
27      1967
28      1972
29      1977
        ... 
1674    1982
1675    1987
1676    1992
1677    1997
1678    2002
1679    2007
1680    1952
1681    1957
1682    1962
1683    1967
1684    1972
1685    1977
1686    1982
1687    1987
1688    1992
1689    1997
1690    2002
1691    2007
1692    1952
1693    1957
1694    1962
1695    1967
1696    1972
1697    1977
1698    1982
1699    1987
1700    1992
1701    1997
1702    2002
1703    2007
Name: year, Length: 1704, dtype: int64

**Subsetting Rows**<br>
1. loc: subset based on index label ( row name )<br>
2. iloc: Subset based on row index ( row number ) <br.

In [161]:
#subset bases based by Index label:
df.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


In [162]:
# get first row
df.loc[0]

country      Afghanistan
year                1952
pop          8.42533e+06
continent           Asia
lifeExp           28.801
gdpPercap        779.445
Name: 0, dtype: object

In [163]:
df.loc[99]

country       Bangladesh
year                1967
pop          6.28219e+07
continent           Asia
lifeExp           43.453
gdpPercap        721.186
Name: 99, dtype: object

In [165]:
#get the last row
df.loc[-1] # this will cause an error due to looks for -1

KeyError: 'the label [-1] is not in the [index]'

In [167]:
#solution is;
number_of_rows = df.shape[0]

last_row_index = number_of_rows - 1

print(last_row_index)

1703


In [169]:
#last row
df.loc[last_row_index]

country         Zimbabwe
year                2007
pop          1.23111e+07
continent         Africa
lifeExp           43.487
gdpPercap        469.709
Name: 1703, dtype: object

In [171]:
#one more method
df.tail(n=1)  # same for first row

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
1703,Zimbabwe,2007,12311143.0,Africa,43.487,469.709298


**Subsetting Multiple Rows**

In [172]:
#select 1st, 99 and 1000 rows
df.loc[ [0,99,999]]

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
99,Bangladesh,1967,62821884.0,Asia,43.453,721.186086
999,Mongolia,1967,1149500.0,Asia,51.253,1226.04113


**Subset Rows by Row Number: iloc** <br>
same as loc but the index labels are the row numbers.

In [173]:
df.iloc[0]

country      Afghanistan
year                1952
pop          8.42533e+06
continent           Asia
lifeExp           28.801
gdpPercap        779.445
Name: 0, dtype: object

In [174]:
#get the last row
print(df.iloc[-1])

country         Zimbabwe
year                2007
pop          1.23111e+07
continent         Africa
lifeExp           43.487
gdpPercap        469.709
Name: 1703, dtype: object


In [175]:
## get the first, 100th, and 1000th rows

print(df.iloc[[0, 99, 999]])

         country  year         pop continent  lifeExp    gdpPercap
0    Afghanistan  1952   8425333.0      Asia   28.801   779.445314
99    Bangladesh  1967  62821884.0      Asia   43.453   721.186086
999     Mongolia  1967   1149500.0      Asia   51.253  1226.041130


**General syntax** to obtain columns, rows or both:<br>
df.loc[[rows], [columns]] or df.iloc[[rows], [columns]] <br>

**Subnetting columns**

In [180]:
#select all rows of year and pop columns
# loc allows us to use column label but not column number
subset = df.loc[:, ['year', 'pop']]
subset.head()

Unnamed: 0,year,pop
0,1952,8425333.0
1,1957,9240934.0
2,1962,10267083.0
3,1967,11537966.0
4,1972,13079460.0


In [181]:
# subset columns with iloc

# iloc will alow us to use integers

# -1 will select the last column

subset = df.iloc[:, [2, 4, -1]]
subset.head()

Unnamed: 0,pop,lifeExp,gdpPercap
0,8425333.0,28.801,779.445314
1,9240934.0,30.332,820.85303
2,10267083.0,31.997,853.10071
3,11537966.0,34.02,836.197138
4,13079460.0,36.088,739.981106


In [183]:
subset = df.loc[:, [2, 4, -1]] #Error in loc due to passing integer

KeyError: 'None of [[2, 4, -1]] are in the [columns]'

In [185]:
subset = df.iloc[:, ['year', 'pop']] #Error due to passing index label to iloc

TypeError: cannot perform reduce with flexible type

**Subsetting Columns by Range**: range

In [187]:
small_range = list(range(5))

subset = df.iloc[:, small_range]
subset.head()

Unnamed: 0,country,year,pop,continent,lifeExp
0,Afghanistan,1952,8425333.0,Asia,28.801
1,Afghanistan,1957,9240934.0,Asia,30.332
2,Afghanistan,1962,10267083.0,Asia,31.997
3,Afghanistan,1967,11537966.0,Asia,34.02
4,Afghanistan,1972,13079460.0,Asia,36.088


**Questions**: What happens when you specify a range that’s beyond the number of columns you have?<br>
What happens if you use the slicing method with two colons, but leave a value out? For example, what is the result in each of the following cases?

■ df.iloc[:, 0:6:] <br>

■ df.iloc[:, 0::2] <br>

■ df.iloc[:, :6:2] <br>

■ df.iloc[:, ::2] <br>

■ df.iloc[:, ::] <br>

**Subsetting Rows and Columns**

In [189]:
# Using loc
df.loc[42, 'country']

'Angola'

In [190]:
#Using iloc
df.iloc[42, 0 ]

'Angola'

In [191]:
print(df.loc[42, 0]) ##Error: why ?

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0] of <class 'int'>

In [192]:
# get the 1st, 100th, and 1000th rows

# from the 1st, 4th, and 6th columns

# the columns we are hoping to get are

# country, lifeExp, and gdpPercap
print(df.iloc[[0, 99, 999], [0, 3, 5]])

         country continent    gdpPercap
0    Afghanistan      Asia   779.445314
99    Bangladesh      Asia   721.186086
999     Mongolia      Asia  1226.041130


In [193]:
# if we use the column names directly,

# it makes the code a bit easier to read

# note now we have to use loc, instead of iloc

print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])

         country  lifeExp    gdpPercap
0    Afghanistan   28.801   779.445314
99    Bangladesh   43.453   721.186086
999     Mongolia   51.253  1226.041130


**GROUPED AND AGGREGATED CALCULATIONS**

 For each year in our data, what was the average life expectancy?

In [197]:
## we need to split our data into parts by year;
print(df.groupby('year')['lifeExp'].mean() )

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64


In [199]:
#if more than one variable?

multi_group_var = df.groupby(['year', 'continent'])[['lifeExp', 'gdpPercap']].mean()
multi_group_var.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,lifeExp,gdpPercap
year,continent,Unnamed: 2_level_1,Unnamed: 3_level_1
2007,Africa,54.806038,3089.032605
2007,Americas,73.60812,11003.031625
2007,Asia,70.728485,12473.02687
2007,Europe,77.6486,25054.481636
2007,Oceania,80.7195,29810.188275


**Grouped Frequency Counts** using value_counts and nunique 

In [205]:
df_test = pd.DataFrame( [ 1,2,1,2 , 3] )
df_test

Unnamed: 0,0
0,1
1,2
2,1
3,2
4,3


In [206]:
df_test[0].value_counts()

2    2
1    2
3    1
Name: 0, dtype: int64

In [207]:
df_test[0].nunique()

3

In [208]:
#calculate the number of unique values in a series?
print(df.groupby('continent')['country'].nunique())

continent
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64


In [209]:
print(df.groupby('continent')['country'].value_counts())

continent  country                 
Africa     Algeria                     12
           Angola                      12
           Benin                       12
           Botswana                    12
           Burkina Faso                12
           Burundi                     12
           Cameroon                    12
           Central African Republic    12
           Chad                        12
           Comoros                     12
           Congo, Dem. Rep.            12
           Congo, Rep.                 12
           Cote d'Ivoire               12
           Djibouti                    12
           Egypt                       12
           Equatorial Guinea           12
           Eritrea                     12
           Ethiopia                    12
           Gabon                       12
           Gambia                      12
           Ghana                       12
           Guinea                      12
           Guinea-Bissau               1