# Introduction to Pandas

Data Science  
TECNUN - Escuela de Ingenier√≠a  
Universidad de Navarra

- Idoia Ochoa: iochoal@unav.es

To start, we are going to load the `Pandas` module.

In [None]:
import pandas as pd
import numpy as np

## Introduction to pandas

The `Pandas` module works with dataframes. Let's start by creating one from a dictionary.

In [None]:
data = {'Country' : ['Belgium', 'India', 'Brazil'],
        'Capital': ['Brussels', 'ND', 'Brasilia'],
        'Population' : [11000000, 1300000000, 200000000]}

In [None]:
# Generate a dataframe from this Dictionary
df = pd.DataFrame(data)


In [None]:
print(df)

   Country   Capital  Population
0  Belgium  Brussels    11000000
1    India        ND  1300000000
2   Brazil  Brasilia   200000000


With the `head` method, we can view the first few rows of the dataframe. In this case, we will see the entire dataframe because it only has 3 rows.



In [None]:
df.head() # will print the first 5 rows of the df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11000000
1,India,ND,1300000000
2,Brazil,Brasilia,200000000


In [None]:
df.tail() # print the last 5 rows

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11000000
1,India,ND,1300000000
2,Brazil,Brasilia,200000000


You can also create the dataframe from a numpy array as follows.

In [None]:
data_np = np.array([['Belgium','Brussels',11000000],
                 ['India','ND',1300000000],
                 ['Brazil','Brasilia',200000000]])


In [None]:
# pass column names in the columns parameter
df2 = pd.DataFrame(data_np, columns=['Country','Capital','Population'])

In [None]:
df2.head()

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11000000
1,India,ND,1300000000
2,Brazil,Brasilia,200000000


Next, we will look at different ways to access the data. The first way is to access the data in the same way we would access elements of a list, numpy array, etc. It is important to note that this way we can only select different observations, not the variables of the observations. To access the different variables, we need to use the corresponding labels.

In [None]:
# Access rows/samples
df[0:2]

Unnamed: 0,Country,Capital,Population
1,India,ND,1300000000


The different variables are attributes of the dataframe, so we can also access them in the following way:

In [None]:
# Access specific columns
df.Country


Unnamed: 0,Country
0,Belgium
1,India
2,Brazil


In [None]:
df[['Country']]

Unnamed: 0,Country
0,Belgium
1,India
2,Brazil


We can also use the `iloc` method to access different information by position.

In [None]:
df.iloc[0:2,1:3]

Unnamed: 0,Capital,Population
0,Brussels,11000000
1,ND,1300000000


On the other hand, to access data by label, we will need to use the `loc` method.



In [None]:
df.loc[0:2,['Country','Population']]

Unnamed: 0,Country,Population
0,Belgium,11000000
1,India,1300000000
2,Brazil,200000000


We can also access information conditionally.

In [None]:
df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11000000
1,India,ND,1300000000
2,Brazil,Brasilia,200000000


In [None]:
a = df.loc[0:1,['Population']]

In [None]:
print(type(a))

<class 'pandas.core.frame.DataFrame'>


In [None]:
a

Unnamed: 0,Population
0,11000000
1,1300000000
2,200000000


In [None]:
df['Population']>50000000

Unnamed: 0,Population
0,False
1,True
2,True


In [None]:
df[df['Population']>50000000]

Unnamed: 0,Country,Capital,Population
1,India,ND,1300000000
2,Brazil,Brasilia,200000000


With the `drop` method, we can keep part of the dataframe.

In [None]:
df_1 = df.drop([0,2], axis=0) # Drops lines/rows 0 and 2
# axis 0 is the rows
# axis 1 is the columns
print(df_1)

  Country Capital  Population
1   India      ND  1300000000


In [None]:
df.drop(['Capital'], axis=1) # Drops columns

Unnamed: 0,Country,Population
0,Belgium,11000000
1,India,1300000000
2,Brazil,200000000


In [None]:
df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11000000
1,India,ND,1300000000
2,Brazil,Brasilia,200000000


In [None]:
df_no_capital_col = df.drop(['Capital'], axis=1)

In [None]:
df_no_capital_col

Unnamed: 0,Country,Population
0,Belgium,11000000
1,India,1300000000
2,Brazil,200000000


In [None]:
# df.drop(['Capital'], axis=1, inplace=True) # Drops column and MODIFY THE DATAFRAME

In [None]:
# df

We can sort the dataframe based on the indices using the `sort_index` method. This allows you to reorder the rows (or columns) of the dataframe according to the index values.

In [None]:
sorted_df = df.sort_index()

In [None]:
sorted_df


Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11000000
1,India,ND,1300000000
2,Brazil,Brasilia,200000000


In [None]:
sorted_df_population = df.sort_values(by='Population')

In [None]:
sorted_df_population

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11000000
2,Brazil,Brasilia,200000000
1,India,ND,1300000000


Pandas allows us to sort a dataframe based on the values of one or more columns using the `sort_values` method. This is particularly useful when you want to rank or organize data according to specific criteria.

We can also add a new column.

In [None]:
# df['Country_Capital']

In [None]:
# Let's add a new column called 'Country_Capital'
df['Country_Capital'] = df['Country'] + '_' + df['Capital']

In [None]:
df

Unnamed: 0,Country,Capital,Population,Country_Capital
0,Belgium,Brussels,11000000,Belgium_Brussels
1,India,ND,1300000000,India_ND
2,Brazil,Brasilia,200000000,Brazil_Brasilia


We can also add one row to the dataframe.

In [None]:
df.loc[3] = ['Germany','Berlin',80000000, 'Germany_Berlin']

In [None]:
df

Unnamed: 0,Country,Capital,Population,Country_Capital
0,Belgium,Brussels,11000000,Belgium_Brussels
1,India,ND,1300000000,India_ND
2,Brazil,Brasilia,200000000,Brazil_Brasilia
3,Germany,Berlin,80000000,Germany_Berlin


We can delete duplicates as follows.

In [None]:
df.loc[4] = ['India','ND',1300000000, 'India_ND']

In [None]:
df

Unnamed: 0,Country,Capital,Population,Country_Capital
0,Belgium,Brussels,11000000,Belgium_Brussels
1,India,ND,1300000000,India_ND
2,Brazil,Brasilia,200000000,Brazil_Brasilia
3,Germany,Berlin,80000000,Germany_Berlin
4,India,ND,1300000000,India_ND


In [None]:
df.drop_duplicates()

Unnamed: 0,Country,Capital,Population,Country_Capital
0,Belgium,Brussels,11000000,Belgium_Brussels
1,India,ND,1300000000,India_ND
2,Brazil,Brasilia,200000000,Brazil_Brasilia
3,Germany,Berlin,80000000,Germany_Berlin


In [None]:
df

Unnamed: 0,Country,Capital,Population,Country_Capital
0,Belgium,Brussels,11000000,Belgium_Brussels
1,India,ND,1300000000,India_ND
2,Brazil,Brasilia,200000000,Brazil_Brasilia
3,Germany,Berlin,80000000,Germany_Berlin
4,India,ND,1300000000,India_ND


Why the df did not get modified? We need to use option `inplace = True`

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df

Unnamed: 0,Country,Capital,Population,Country_Capital
0,Belgium,Brussels,11000000,Belgium_Brussels
1,India,ND,1300000000,India_ND
2,Brazil,Brasilia,200000000,Brazil_Brasilia
3,Germany,Berlin,80000000,Germany_Berlin


The `describe()` method provides a summary of the numerical features

In [None]:
df.describe()

Unnamed: 0,Population
count,4.0
mean,397750000.0
std,606547800.0
min,11000000.0
25%,62750000.0
50%,140000000.0
75%,475000000.0
max,1300000000.0


We can also create a dataframe from a `.csv` file as follows.

In [None]:
df_csv = pd.read_csv('https://raw.githubusercontent.com/iochoa/ml-datasets/master/diabetes.csv')

In [None]:
df_csv

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
df_csv.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
df_csv.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


We can also apply a function to a dataframe to modify some values.

For example, let's create a function that transforms all values bigger than 5 to 6.

In [None]:
def transform_value(x):
    if x > 5:
        return 6
    else:
        return x

In [None]:
# Let's see if this function works
transform_value(7)

6

To apply the function to a dataframe, you must call the apply function from the dataframe. It allows to apply the function in every single row of the column(s) listed. Is vectorized and parallel when available.

Let's apply the function to the `Pregnancies` column and store the result in a new column called `Pregnancies_cap`

In [None]:
df_csv['Pregnancies_cap'] = df_csv['Pregnancies'].apply(transform_value)

In [None]:
df_csv.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Pregnancies_cap
0,6,148,72,35,0,33.6,0.627,50,1,6
1,1,85,66,29,0,26.6,0.351,31,0,1
2,8,183,64,0,0,23.3,0.672,32,1,6
3,1,89,66,23,94,28.1,0.167,21,0,1
4,0,137,40,35,168,43.1,2.288,33,1,0


Now, let's do this, the python way! Using a `lambda` function on an apply. Take in consideration that when it comes to the conditional statements (as if), the structure is the following: &lt; ResponseIfTrue &gt; &lt; Condition &gt; &lt; Else &gt;  &lt; ResponseElse &gt;

In [None]:
df_csv['Pregnancies_cap_2'] = df_csv['Pregnancies'].apply(lambda x: 6 if x > 5 else x)

Let's verify that both options result in the exact same transformation.

In [None]:
sum(df_csv['Pregnancies_cap_2'] - df_csv['Pregnancies_cap'])

0