# Teacher session 08 - Introduction to Pandas

Content:
- Load pandas library
- Load the Gapminder dataset (how to read a CSV file as a dataframe)
- Look at the structure of data, subset on columns and rows


# Pandas?


<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-pandas_start.png">

# Pandas as a python library
### Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig1.png">

Pandas documents provides detail documentations : https://pandas.pydata.org/docs/

# Loading pandas library

In [None]:
# we import the library pandas and give it the "pd" nickname
import pandas as pd

# Loading gapminder dataset

We use a dataset that contains population, GDP, life expectency over multiple years see for similar datasets https://www.gapminder.org/data/

To read data files, we use pandas read_csv function. Read more : https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [None]:
# we use pandas.read_csv() function to access the file "gapminder.tsv" stored in a remote location 

# with the argument sep='\t' we indicate that the columns are separated by tabs rather than commas.

# the default value for sep=',' so to read comma seperated files, we can skip this parameter

df = pd.read_csv('https://raw.githubusercontent.com/msh209/INT3625/main/data/gapminder.tsv', sep='\t')


### df is a DataFrame.
### DataFrames are core entities in data analytics

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig2.png">

In [None]:
# you can use type function to investigate the type of a given variable/object
type(df)

# Observing data

We use pandas head function to returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. Read more about head function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

In [None]:
# we show the first 5 rows
# here country, continent, year, lifeExp, pop, gdpPercp are columns (also called variable or features) 
# and each country/year is a row (datapoint, or observation)
df.head(5)

DataFrame.shape Return a tuple representing the dimensionality of the DataFrame.

In [None]:
# we show the size of our dataset (number of columns vs rows)
df.shape

DataFrame.info Print a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [None]:
# we get some more detailed info on our dataset
df.info()

## Selecting columns

DataFrame.columns shows the column labels of the DataFrame.

In [None]:
df.columns

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig3.png">

We can subset on a column by its name

In [None]:
df['country'].head()

To select multiple columns, use a list of column names within the selection brackets 

In [None]:
# we can extract several columns at the same time

df[['country','lifeExp']].head()

## Selecting rows

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-fig4.png">

Let's subset on the first row. Python starts counting from zero

In [None]:
df.iloc[0]


Let's extract the 100th row. Python starts counting from zero

In [None]:
df.iloc[99]

To select multiple rows, use a list of column names within the selection brackets 

In [None]:
df.iloc[[0,99,999]]

See this for more details on subsetting in pandas: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

## Practice1 : select columns country, continent, and year for rows 1,100, and 1000

In [None]:
df[['country','continent','year']].iloc[[0,99,999]]

In [None]:
df.iloc[[0,99,999]][['country','continent','year']]

## Practice2: select columns year and pop for rows 5-20 and 100-999 (all inclusive)


In [None]:
# we can create two seperate list and then combine them and use them as input to iloc
a=list(range(4,20+1))
b=list(range(99,999+1))
# here the sum (+) combines the two lists
c=a+b

print(c)

In [None]:
df.iloc[c][['year','pop']]

## More Resources


Check out this video for further details on Pandas: https://www.youtube.com/watch?v=vmEHCJofslg

Check out Pandas official guide: https://pandas.pydata.org/docs/user_guide/10min.html


<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/session02-pandas_end.png">

