<a href="https://colab.research.google.com/github/leonardonels/Colab-python/blob/main/4_Esercizi_Using_Pandas_student_Part_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

# Load data

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(url, header=None, names=['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])

iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


# Data cleaning

## Missing values

### Is there any missing value in the dataframe?

In [14]:
if(iris.isnull().astype(int).sum().sum()):
  print('yes, there are missing values!')
else:
  print('No missing values found!')

No missing values found!


### Lets set the values of the rows 10 to 29 of the column 'petal_length' to NaN

In [21]:
iris.iloc[10:30,].loc[:,'petal_length']=np.NaN
iris.iloc[10:30,]

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  iris.iloc[10:30,].loc[:,'petal_length']=np.NaN


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
10,5.4,3.7,,0.2,Iris-setosa
11,4.8,3.4,,0.2,Iris-setosa
12,4.8,3.0,,0.1,Iris-setosa
13,4.3,3.0,,0.1,Iris-setosa
14,5.8,4.0,,0.2,Iris-setosa
15,5.7,4.4,,0.4,Iris-setosa
16,5.4,3.9,,0.4,Iris-setosa
17,5.1,3.5,,0.3,Iris-setosa
18,5.7,3.8,,0.3,Iris-setosa
19,5.1,3.8,,0.3,Iris-setosa


### Which column has the maximum number of missing values?

In [34]:
iris.iloc[:,np.argmax(iris.isnull().astype(int).sum())]

Unnamed: 0,petal_length
0,1.4
1,1.4
2,1.3
3,1.5
4,1.4
...,...
145,5.2
146,5.0
147,5.2
148,5.4


### Try to substitute the NaN values with two methods:
- replace null values with column mean (apply it to a copy of the dataframe)
- replace null values with 1.0



In [46]:
iris.loc[:,'petal_length'].fillna(iris.loc[:,'petal_length'].mean())

Unnamed: 0,petal_length
0,1.4
1,1.4
2,1.3
3,1.5
4,1.4
...,...
145,5.2
146,5.0
147,5.2
148,5.4


In [47]:
iris.loc[:,'petal_length'].fillna(1.0, inplace=True)
iris.iloc[10:30,]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  iris.loc[:,'petal_length'].fillna(1.0, inplace=True)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
10,5.4,3.7,1.0,0.2,Iris-setosa
11,4.8,3.4,1.0,0.2,Iris-setosa
12,4.8,3.0,1.0,0.1,Iris-setosa
13,4.3,3.0,1.0,0.1,Iris-setosa
14,5.8,4.0,1.0,0.2,Iris-setosa
15,5.7,4.4,1.0,0.4,Iris-setosa
16,5.4,3.9,1.0,0.4,Iris-setosa
17,5.1,3.5,1.0,0.3,Iris-setosa
18,5.7,3.8,1.0,0.3,Iris-setosa
19,5.1,3.8,1.0,0.3,Iris-setosa


### Set the first 3 rows as NaN

In [57]:
iris.iloc[:3]=np.NaN
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,,,,,
1,,,,,
2,,,,,
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


### Delete the rows that have all NaN

In [80]:
iris.dropna(inplace=True)
iris

Unnamed: 0,index,sepal_length,sepal_width,petal_length,petal_width,class
0,3,4.6,3.1,1.5,0.2,Iris-setosa
1,4,5.0,3.6,1.4,0.2,Iris-setosa
2,5,5.4,3.9,1.7,0.4,Iris-setosa
3,6,4.6,3.4,1.4,0.3,Iris-setosa
4,7,5.0,3.4,1.5,0.2,Iris-setosa
...,...,...,...,...,...,...
142,145,6.7,3.0,5.2,2.3,Iris-virginica
143,146,6.3,2.5,5.0,1.9,Iris-virginica
144,147,6.5,3.0,5.2,2.0,Iris-virginica
145,148,6.2,3.4,5.4,2.3,Iris-virginica


### Reset the index so it begins with 0 again

In [63]:
iris.reset_index(inplace=True)
iris

Unnamed: 0,index,sepal_length,sepal_width,petal_length,petal_width,class
0,3,4.6,3.1,1.5,0.2,Iris-setosa
1,4,5.0,3.6,1.4,0.2,Iris-setosa
2,5,5.4,3.9,1.7,0.4,Iris-setosa
3,6,4.6,3.4,1.4,0.3,Iris-setosa
4,7,5.0,3.4,1.5,0.2,Iris-setosa
...,...,...,...,...,...,...
142,145,6.7,3.0,5.2,2.3,Iris-virginica
143,146,6.3,2.5,5.0,1.9,Iris-virginica
144,147,6.5,3.0,5.2,2.0,Iris-virginica
145,148,6.2,3.4,5.4,2.3,Iris-virginica


## Duplicates

### Does the dataframe contain duplicated rows? If any, visualize all duplicated rows (don't omit first or last occurrences)

In [69]:
if(iris.duplicated().astype(int).sum().sum()):
  print('yes, there are duplicated rows!')
else:
  print('No duplicated rows found!')

No duplicated rows found!


### Which row is the most repeated?

In [70]:
#no one

### Drop duplicated rows

In [71]:
#done?

## Detect outliers, e.g., values that are higher than 85th percentile and lower than 25th percentile.

In [96]:
cols=iris.shape[1]-1
outliers = pd.DataFrame()

for i in range(1, cols):
  p25th=iris.iloc[:,i].quantile(0.25)
  p85th=iris.iloc[:,i].quantile(0.85)
  '''print(str(p25th)+' '+str(p85th))'''

  col_outliers = iris[(iris.iloc[:, i] < p25th) | (iris.iloc[:, i] > p85th)]
  outliers = pd.concat([outliers, col_outliers])

print(outliers)

     index  sepal_length  sepal_width  petal_length  petal_width  \
0        3           4.6          3.1           1.5          0.2   
1        4           5.0          3.6           1.4          0.2   
3        6           4.6          3.4           1.4          0.3   
4        7           5.0          3.4           1.5          0.2   
5        8           4.4          2.9           1.4          0.2   
..     ...           ...          ...           ...          ...   
138    141           6.9          3.1           5.1          2.3   
140    143           6.8          3.2           5.9          2.3   
141    144           6.7          3.3           5.7          2.5   
142    145           6.7          3.0           5.2          2.3   
145    148           6.2          3.4           5.4          2.3   

              class  
0       Iris-setosa  
1       Iris-setosa  
3       Iris-setosa  
4       Iris-setosa  
5       Iris-setosa  
..              ...  
138  Iris-virginica  
140  Ir

# Data transformation

## Replace class values by removing "Iris-" prefix (use a dictionary)

In [98]:
class_mapping = {
    'Iris-setosa': 'setosa',
    'Iris-versicolor': 'versicolor',
    'Iris-virginica': 'virginica'
}
iris['class'] = iris['class'].replace(class_mapping)
iris

Unnamed: 0,index,sepal_length,sepal_width,petal_length,petal_width,class
0,3,4.6,3.1,1.5,0.2,setosa
1,4,5.0,3.6,1.4,0.2,setosa
2,5,5.4,3.9,1.7,0.4,setosa
3,6,4.6,3.4,1.4,0.3,setosa
4,7,5.0,3.4,1.5,0.2,setosa
...,...,...,...,...,...,...
142,145,6.7,3.0,5.2,2.3,virginica
143,146,6.3,2.5,5.0,1.9,virginica
144,147,6.5,3.0,5.2,2.0,virginica
145,148,6.2,3.4,5.4,2.3,virginica


## Delete columns
Delete for example class column

In [102]:
iris.drop('class', axis=1, inplace=True)

In [103]:
iris

Unnamed: 0,index,sepal_length,sepal_width,petal_length,petal_width
0,3,4.6,3.1,1.5,0.2
1,4,5.0,3.6,1.4,0.2
2,5,5.4,3.9,1.7,0.4
3,6,4.6,3.4,1.4,0.3
4,7,5.0,3.4,1.5,0.2
...,...,...,...,...,...
142,145,6.7,3.0,5.2,2.3
143,146,6.3,2.5,5.0,1.9
144,147,6.5,3.0,5.2,2.0
145,148,6.2,3.4,5.4,2.3


In [108]:
iris.drop('index', axis=1, inplace=True)
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,4.6,3.1,1.5,0.2
1,5.0,3.6,1.4,0.2
2,5.4,3.9,1.7,0.4
3,4.6,3.4,1.4,0.3
4,5.0,3.4,1.5,0.2
...,...,...,...,...
142,6.7,3.0,5.2,2.3
143,6.3,2.5,5.0,1.9
144,6.5,3.0,5.2,2.0
145,6.2,3.4,5.4,2.3


## How to normalize all columns in a dataframe?
- Normalize all columns of df by subtracting the column mean and divide by standard deviation.
- Range all columns of df such that the minimum value in each column is 0 and max is 1.

In [110]:
(iris-iris.mean())/iris.std()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,-1.530460,0.113830,-1.217861,-1.345671
1,-1.045595,1.259928,-1.272193,-1.345671
2,-0.560729,1.947587,-1.109196,-1.081568
3,-1.530460,0.801489,-1.272193,-1.213619
4,-1.045595,0.801489,-1.217861,-1.345671
...,...,...,...,...
142,1.015084,-0.115389,0.792441,1.427418
143,0.530219,-1.261488,0.683776,0.899210
144,0.772652,-0.115389,0.792441,1.031262
145,0.409002,0.801489,0.901106,1.427418


## Binning and discretization
Discretize dataframe columns in 4 bins and get the new value frequency distribution

## Binarize categorical data (dummy variables)
Based on the prevoius result, binarize all dataframe columns