[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mosleh-exeter/BEM1025/blob/main/Lecture/02-Lecture02-Getting-to-know-Pandas.ipynb)

# Session 02 - Intro to Pandas

Content:
- Load pandas library
- Load the Gapminder dataset (how to read a CSV file as a dataframe)
- Look at the structure of data, subset on columns and rows


# Pandas?


<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-pandas_start.png">

# Pandas as a python library
### Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig1.png">

Pandas documents provides detail documentations : https://pandas.pydata.org/docs/

# Loading pandas library

In [45]:
# we import the library pandas and give it the "pd" nickname
import pandas as pd

# Loading gapminder dataset

We use a dataset that contains population, GDP, life expectency over multiple years see for similar datasets https://www.gapminder.org/data/

To read data files, we use pandas read_csv function. Read more : https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [46]:
# we use pandas.read_csv() function to access the file "gapminder.tsv" stored in a remote location 

# with the argument sep='\t' we indicate that the columns are separated by tabs rather than commas.

# the default value for sep=',' so to read comma seperated files, we can skip this parameter

df = pd.read_csv('https://raw.githubusercontent.com/mosleh-exeter/BEM1025/main/gapminder.tsv', sep='\t')



### df is a DataFrame.
### DataFrames are core entities in data analytics

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig2.png">

In [47]:
# you can use type function to investigate the type of a given variable/object
type(df)

pandas.core.frame.DataFrame

# Observing data

We use pandas head function to returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. Read more about head function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

In [48]:
# we show the first 5 rows
# here country, continent, year, lifeExp, pop, gdpPercp are columns (also called variable or features) 
# and each country/year is a row (datapoint, or observation)
df.head(5)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


DataFrame.shape Return a tuple representing the dimensionality of the DataFrame.

In [50]:
# we show the size of our dataset (number of columns vs rows)
df.shape

(1704, 6)

DataFrame.info Print a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [51]:
# we get some more detailed info on our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


## Selecting columns

DataFrame.columns shows the column labels of the DataFrame.

In [52]:
df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig3.png">

We can subset on a column by its name

In [53]:
df['country'].head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

To select multiple columns, use a list of column names within the selection brackets 

In [54]:
# we can extract several columns at the same time

df[['country','lifeExp']].head()

Unnamed: 0,country,lifeExp
0,Afghanistan,28.801
1,Afghanistan,30.332
2,Afghanistan,31.997
3,Afghanistan,34.02
4,Afghanistan,36.088


## Selecting rows

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig4.png">

Let's subset on the first row. Python starts counting from zero

In [55]:
df.iloc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap     779.445314
Name: 0, dtype: object


Let's extract the 100th row. Python starts counting from zero

In [56]:
df.iloc[99]

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap    721.186086
Name: 99, dtype: object

To select multiple rows, use a list of column names within the selection brackets 

In [57]:
df.iloc[[0,99,999]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
99,Bangladesh,Asia,1967,43.453,62821884,721.186086
999,Mongolia,Asia,1967,51.253,1149500,1226.04113


See this for more details on subsetting in pandas: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

## Practice1 : select columns country, continent, and year for rows 1,100, and 1000

In [58]:
df[['country','continent','year']].iloc[[0,99,999]]

Unnamed: 0,country,continent,year
0,Afghanistan,Asia,1952
99,Bangladesh,Asia,1967
999,Mongolia,Asia,1967


In [59]:
df.iloc[[0,99,999]][['country','continent','year']]

Unnamed: 0,country,continent,year
0,Afghanistan,Asia,1952
99,Bangladesh,Asia,1967
999,Mongolia,Asia,1967


## Practice2: select columns year and pop for rows 5-20 and 100-999 (all inclusive)


In [60]:
# we can create two seperate list and then combine them and use them as input to iloc
a=list(range(4,20))
b=list(range(99,999))
# here the sum (+) combines the two lists
c=a+b

In [61]:
df.iloc[c][['year','pop']]

Unnamed: 0,year,pop
4,1972,13079460
5,1977,14880372
6,1982,12881816
7,1987,13867957
8,1992,16317921
...,...,...
994,2002,102479927
995,2007,108700891
996,1952,800663
997,1957,882134


## More Resources


Check out this video for further details on Pandas: https://www.youtube.com/watch?v=vmEHCJofslg

Check out Pandas official guide: https://pandas.pydata.org/docs/user_guide/10min.html


<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-pandas_end.png">

