#**Data wrangling Exercise**

Data wrangling or data munging is the process of cleaning, transforming, and mapping data from one
form to another to utilize it for tasks such as analytics, summarization, reporting, visualization, and so on.

Data wrangling is one of most important and involving steps in the whole Data Science workflow. The output
of this process directly impacts all downstream steps such as exploration, summarization, visualization,
analysis and even the final result. This clearly shows why Data Scientists spend a lot of time in Data
Collection and Wrangling.

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

In [1]:
# import required libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing

from IPython.display import display # Display a Python object in all frontends

pd.options.mode.chained_assignment = None # ignoring the warning when working on slices of dataframes 


##Data wrangling utility functions

In [2]:
def describe_dataframe(df=pd.DataFrame()):
    """This function generates descriptive stats of a dataframe
    Args:
        df (dataframe): the dataframe to be analyzed
    Returns:
        None

    """
    print("\n\n")
    print("*"*30)
    print("About the Data")
    print("*"*30)
    
    print("Number of rows::",df.shape[0])
    print("Number of columns::",df.shape[1])
    print("\n")
    
    print("Column Names::",df.columns.values.tolist())
    print("\n")
    
    print("Column Data Types::\n",df.dtypes)
    print("\n")
    
    print("Columns with Missing Values::",df.columns[df.isnull().any()].tolist())
    print("\n")
    
    print("Number of rows with Missing Values::",df.isna().any(axis=1).sum())
    print("\n")
    
    print("Sample Indices with missing data::",df[df.isna().any(axis=1)].index[0:5])
    print("\n")
    
    print("General Stats::")
    print(df.info())
    print("\n")
    
    print("Summary Stats::")
    print(df.describe())
    print("\n")
    
    print("Dataframe Sample Rows::")
    display(df.head(5))
    
def cleanup_column_names(df,rename_dict={},do_inplace=True):
    """This function renames columns of a pandas dataframe
       It converts column names to snake case if rename_dict is not passed. 
    Args:
        rename_dict (dict): keys represent old column names and values point to 
                            newer ones
        do_inplace (bool): flag to update existing dataframe or return a new one
    Returns:
        pandas dataframe if do_inplace is set to False, None otherwise

    """
    if not rename_dict:
        return df.rename(columns={col: col.lower().replace(' ','_').replace(r'/','_') 
                    for col in df.columns.values.tolist()}, 
                  inplace=do_inplace)
    else:
        return df.rename(columns=rename_dict,inplace=do_inplace)

##Wine recognition dataset

This is UCI ML Wine recognition datasets. https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

Original Owners:

Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Dataset characteristics:
* Number of Instances: 178 (50 in each of three classes)
* Number of Attributes: 13 numeric, predictive attributes and the class
* Attribute info:
1. **Alcohol**: alcohol content, reported in units of ABV (alcohol by volume).

1. **Malic acid**: one of the principal organic acids found in wine. Although found in nearly every fruit and berry, it’s flavor is most prominent in green apples; likewise, it projects this sour flavor into wine. For more information, feel free to read about acids in wine.

1. **Ash**: yep, wine has ash in it. Ash is simply the inorganic matter left after evaporation and incineration.

1. **Alcalinity of ash**: the alkalinity of ash determines how basic (as opposed to acidic) the ash in a wine is.

1. **Magnesium**: magnesium is a metal that affects the flavor of wine.

1. **Total phenols**: Phenols are chemicals that affect the taste, color, and mouthfeel (i.e., texture) of wine. For some (very) in-depth information about phenols, we refer you to phenolic content in wine.

1. **Flavoids**: flavonoids are a type of phenol.

1. **Nonflavoid phenols**: nonflavonoids are another type of phenol.

1. **Proanthocyanins**: proanthocyanidins are yet another type of phenol.

1. **Color intensity**: the color intensity of a wine: i.e., how dark it is.

1. **Hue**: the hue of a wine, which is typically determined by the color of the cultivar used (although this is not always the case).

1. **OD280/OD315 of diluted wines**: protein content measurements.

1. **Proline**: an amino acid present in wines.
  
* Class
  * Class 0: 59
  * Class 1: 71
  * Class 2: 48

'messy_wine_data.csv' is a modified from 'Wine recognition dataset' by introducing some missing values.

In [3]:
# Download 'messy_wine_data.csv'
!pip install wget
!python -m wget -o messy_wine_data.csv "https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/messy_wine_data.csv"
df = pd.read_csv('messy_wine_data.csv')
df.head()

# If clone github repository and run notebook locally
# df = pd.read_csv('../data/messy_wine_data.csv')
# df.head()


Collecting wget
  Using cached wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9673 sha256=d14e91459e1b166478b2674d4ef385ed0d61ad0eb6580bdceebe8088375653c8
  Stored in directory: /Users/chenc/Library/Caches/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
You should consider upgrading via the '/Users/chenc/miniconda3/bin/python -m pip install --upgrade pip' command.[0m
100% [..........................................................] 12049 / 12049
Saved under messy_wine_data.csv


Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline,Class
0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0,0


In [None]:
# describe the stats of dataframe
describe_dataframe(df)




******************************
About the Data
******************************
Number of rows:: 178
Number of columns:: 14


Column Names:: ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline', 'Class']


Column Data Types::
 Alcohol                         float64
Malic acid                      float64
Ash                             float64
Alcalinity of ash               float64
Magnesium                       float64
Total phenols                   float64
Flavanoids                      float64
Nonflavanoid phenols            float64
Proanthocyanins                 float64
Color intensity                 float64
Hue                             float64
OD280/OD315 of diluted wines    float64
Proline                         float64
Class                             int64
dtype: object


Columns with Missing Values:: ['Malic aci

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline,Class
0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0,0


In [None]:
print("Shape of df={}".format(df.shape))

Shape of df=(178, 14)


##Rename Columns

In [None]:
print("Dataframe columns:\n{}".format(df.columns.tolist()))

Dataframe columns:
['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline', 'Class']


In [None]:
cleanup_column_names(df)

In [None]:
print("Dataframe columns:\n{}".format(df.columns.tolist()))

Dataframe columns:
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280_od315_of_diluted_wines', 'proline', 'class']


##Sort Rows on defined attributes

In [None]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0,0


In [None]:
# Sort data by ascending malic_acid and decreasing ash
display(df.sort_values(['malic_acid', 'ash'], 
                         ascending=[True, False]).head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
113,11.41,0.74,2.5,21.0,88.0,2.48,2.01,0.42,1.44,3.08,1.1,2.31,434.0,1
84,11.84,0.89,2.58,18.0,94.0,2.2,2.21,0.22,2.35,3.05,0.79,3.08,520.0,1
76,-10.0,0.9,1.71,16.0,86.0,1.95,,0.24,1.46,4.6,1.19,2.48,392.0,1
80,-10.0,0.92,2.0,19.0,86.0,2.42,2.26,0.3,1.43,2.5,1.38,3.12,1.0,1
68,13.34,0.94,2.36,17.0,110.0,2.53,1.3,0.55,0.42,3.17,1.02,1.93,750.0,1


In [None]:
# Sort data by decreasing alcohol
display(df.sort_values(['alcohol'], 
                         ascending=[False]).head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
8,14.83,,2.17,14.0,97.0,2.8,,0.29,1.98,5.2,1.08,2.85,1045.0,0
13,14.75,1.73,2.39,11.4,,3.1,3.69,0.43,2.81,5.4,1.25,2.73,1150.0,0
6,14.39,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290.0,0
46,14.38,3.59,2.28,16.0,102.0,3.25,3.17,0.27,2.19,4.9,1.04,3.44,1065.0,0
14,14.38,1.87,0.0,12.0,102.0,3.3,3.64,0.29,2.96,7.5,1.2,3.0,1547.0,0


##Rearrange Columns in a Dataframe

In [None]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0,0


In [None]:
# Rearrange columns in the order of 'class', 'alcohol',	'malic_acid',	'ash',	'alcalinity_of_ash',	
# 'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',	
# 'color_intensity',	'hue',	'od280_od315_of_diluted_wines','proline'.
display(df[['class', 'alcohol',	'malic_acid',	'ash',	'alcalinity_of_ash',	
'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',	
'color_intensity',	'hue',	'od280_od315_of_diluted_wines','proline']].head())

Unnamed: 0,class,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline
0,0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1.04,3.92,1065.0
1,0,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0
2,0,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,0,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,0,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0


In [None]:
# Rearrange columns in the order of 'alcohol', 'color_intensity',	'hue',	'malic_acid',	'ash',	'alcalinity_of_ash',	
# 'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',	
# 'od280_od315_of_diluted_wines','proline', 'class'.
display(df[['alcohol', 'color_intensity',	'hue',	'malic_acid',	'ash',	'alcalinity_of_ash',	
'magnesium',	'total_phenols', 'flavanoids',	'nonflavanoid_phenols',	'proanthocyanins',	
'od280_od315_of_diluted_wines','proline', 'class']].head())

Unnamed: 0,alcohol,color_intensity,hue,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,od280_od315_of_diluted_wines,proline,class
0,14.23,5.64,1.04,,2.43,15.6,127.0,2.8,,0.28,2.29,3.92,1065.0,0
1,13.2,4.38,1.05,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,3.4,1050.0,0
2,13.16,5.68,1.03,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,3.17,1185.0,0
3,14.37,7.8,0.86,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,3.45,1480.0,0
4,13.24,4.32,1.04,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,2.93,735.0,0


##Filtering Columns

Using Column Index

In [None]:
# print 10 values from column at index 3
print(df.iloc[:,3].values[0:10])

[15.6 11.2 18.6 16.8 21.  15.2 14.6 17.6 14.  16. ]


Using Column Name

In [None]:
# print 10 values of total_phenols
print(df.total_phenols.values[0:10])

[2.8  2.65 2.8  3.85 2.8  3.27 2.5  2.6  2.8  2.98]


Using Column Datatype

In [None]:
# print 10 values of columns with data type float
print(df.select_dtypes(include=['float64']).values[:10,0])

[ 14.23  13.2   13.16  14.37  13.24  14.2   14.39  14.06  14.83 -10.  ]


##Filtering Rows
Select specific rows

In [None]:
# Select rows of 21, 45, 100
display(df.iloc[[21, 45, 100]])

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
21,12.93,3.8,2.65,18.6,102.0,2.41,2.41,0.25,1.98,4.5,1.03,3.52,770.0,0
45,14.21,4.04,2.44,18.9,111.0,2.85,2.65,0.3,1.25,5.24,0.87,3.33,1080.0,0
100,12.08,2.08,1.7,17.5,97.0,2.23,2.17,0.26,1.4,3.3,1.27,2.96,710.0,1


Exclude Specific Row indices

In [None]:
# drop the first and third rows
display(df.drop([0, 2], axis=0).head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0,0
5,14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450.0,0
6,14.39,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290.0,0


Conditional Filtering

In [None]:
# Get those wines with ash > 2
display(df[df.ash > 2].head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1.04,2.93,735.0,0


Offset from top of the dataframe

In [None]:
# Skip the top 100 rows
display(df[100:].head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
100,12.08,2.08,1.7,17.5,97.0,2.23,2.17,0.26,1.4,3.3,1.27,2.96,710.0,1
101,12.6,1.34,1.9,18.5,88.0,1.45,1.36,0.29,1.35,2.45,1.04,2.77,562.0,1
102,12.34,2.45,2.46,21.0,98.0,2.56,2.11,0.34,1.31,2.8,0.8,3.38,438.0,1
103,11.82,1.72,1.88,19.5,86.0,2.5,1.64,0.37,1.42,2.06,0.94,2.44,415.0,1
104,12.51,1.73,1.98,20.5,85.0,2.2,1.92,0.32,1.48,2.94,1.04,3.57,672.0,1


Offset from bottom of the dataframe

In [None]:
# Skip the last 10 rows
display(df[-10:].head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
168,13.58,2.58,2.69,24.5,105.0,1.55,0.84,0.39,1.54,8.66,0.74,1.8,750.0,2
169,13.4,4.6,2.86,25.0,112.0,1.98,0.96,0.27,1.11,8.5,0.67,1.92,630.0,2
170,12.2,3.03,2.32,19.0,96.0,1.25,0.49,0.4,0.73,5.5,0.66,1.83,510.0,2
171,12.77,2.39,2.28,19.5,86.0,1.39,0.51,0.48,0.64,0.0,0.57,1.63,470.0,2
172,14.16,2.51,2.48,20.0,91.0,1.68,0.7,0.44,1.24,9.7,0.62,1.71,660.0,2


##TypeCasting/Data Type Conversion

In [None]:
print("Old dtypes:\n", df.dtypes)
# change the data type of hue	 object to 'int'
df['hue'] = df['hue'].astype(int)
# compare dtypes of the original df with this one
print("New dtypes:\n", df.dtypes)

Old dtypes:
 alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64
hue                             float64
od280_od315_of_diluted_wines    float64
proline                         float64
class                             int64
dtype: object
New dtypes:
 alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64


##Missing Values


In [None]:
# Drop rows with missing values in 'malic_acid' column
df_dropped = df.dropna(subset=['malic_acid'])
df_dropped.shape

(165, 14)

In [None]:
# Fill Missing Price values with mean price
df_dropped['magnesium'].fillna(value=np.round(df.magnesium.mean(),decimals=2),inplace=True)

In [None]:
# Fill Missing user_type values with value from previous row (forward fill)
df_dropped['flavanoids'].fillna(method='ffill',inplace=True)

In [None]:
# Fill Missing user_type values with value from next row (backward fill)
df_dropped['flavanoids'].fillna(method='bfill',inplace=True)

##Duplicates


In [None]:
# Before dropping Duplicate 'alcohol' rows
display(df_dropped.head())
print("Shape of df before dropping duplicates ={}".format(df_dropped.shape))

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
1,13.2,1.78,2.14,11.2,100.0,2.65,3.24,0.26,1.28,4.38,1,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,99.26,2.8,3.24,0.3,2.81,5.68,1,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,99.26,2.8,2.69,0.39,-1.0,4.32,1,2.93,735.0,0
5,14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1,2.85,1450.0,0


Shape of df before dropping duplicates =(165, 14)


In [None]:
# After dropping Duplicate 'alcohol' rows
df_dropped.drop_duplicates(subset=['alcohol'],inplace=True)
# updated dataframe
display(df_dropped.head())
print("Shape of df after dropping duplicates ={}".format(df_dropped.shape))

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
1,13.2,1.78,2.14,11.2,100.0,2.65,3.24,0.26,1.28,4.38,1,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,99.26,2.8,3.24,0.3,2.81,5.68,1,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,99.26,2.8,2.69,0.39,-1.0,4.32,1,2.93,735.0,0
5,14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1,2.85,1450.0,0


Shape of df after dropping duplicates =(113, 14)


##Encode Categoricals


In [None]:
# Get One Hot Encoding using get_dummies() for 'class'
display(pd.get_dummies(df,columns=['class']).head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class_0,class_1,class_2
0,14.23,,2.43,15.6,127.0,2.8,,0.28,2.29,5.64,1,3.92,1065.0,1,0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,,0.26,1.28,4.38,1,3.4,1050.0,1,0,0
2,13.16,2.36,2.67,18.6,,2.8,3.24,0.3,2.81,5.68,1,3.17,1185.0,1,0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0,3.45,1480.0,1,0,0
4,13.24,2.59,2.87,21.0,,2.8,2.69,0.39,-1.0,4.32,1,2.93,735.0,1,0,0


##Random Sampling data from DataFrame

In [None]:
# Randomly sample 30% of samples
display(df.sample(frac=0.3, replace=True, random_state=42).head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
102,12.34,2.45,2.46,21.0,98.0,2.56,2.11,0.34,1.31,2.8,0,3.38,438.0,1
92,12.69,1.53,2.26,20.7,80.0,1.38,1.46,0.58,1.62,3.05,0,2.06,495.0,1
14,14.38,1.87,0.0,12.0,102.0,3.3,3.64,0.29,2.96,7.5,1,3.0,1547.0,0
106,12.25,1.73,2.12,19.0,80.0,1.65,2.03,0.37,1.63,3.4,1,3.17,510.0,1
71,13.86,1.51,2.67,25.0,,2.95,2.86,0.21,1.87,3.38,1,3.16,410.0,1


##Normalizing Numeric Values
Normalize 'alcohol' values using **Min-Max Scaler**

In [None]:
# Normalize 'alcohol' values using Min-Max Scaler
df_normalized = df.dropna().copy()
# Create a min_max_scaler
min_max_scaler = preprocessing.MinMaxScaler()
# Transform data, reshape your data using array.reshape(-1, 1) if your data has a single feature
alcohol_scaled = min_max_scaler.fit_transform(df_normalized['alcohol'].values.reshape(-1,1))
df_normalized['alcohol'] = alcohol_scaled.reshape(-1,1)

In [None]:
display(df_normalized.head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
3,0.99918,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0,3.45,1480.0,0
5,0.99221,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1,2.85,1450.0,0
6,1.0,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1,3.58,1290.0,0
11,0.98893,1.48,0.0,16.8,95.0,2.2,2.43,0.26,1.57,5.0,1,2.82,1280.0,0
12,0.97376,1.73,2.41,16.0,89.0,2.6,2.76,0.29,1.81,5.6,1,2.9,1320.0,0


Normalize quantity purchased values using **Robust Scaler**

In [None]:
# Normalize 'magnesium' values using Robust Scaler
df_normalized = df.dropna().copy()
# Create a RobustScaler
robust_scaler = preprocessing.RobustScaler()
magnesium_scaled = robust_scaler.fit_transform(df_normalized['magnesium'].values.reshape(-1,1))
df_normalized['magnesium'] = magnesium_scaled.reshape(-1,1)

In [None]:
display(df_normalized.head())

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,class
3,14.37,1.95,2.5,16.8,0.842105,3.85,3.49,0.24,2.18,7.8,0,3.45,1480.0,0
5,14.2,1.76,2.45,15.2,0.789474,3.27,3.39,0.34,1.97,6.75,1,2.85,1450.0,0
6,14.39,1.87,2.45,14.6,-0.052632,2.5,2.52,0.3,1.98,5.25,1,3.58,1290.0,0
11,14.12,1.48,0.0,16.8,-0.105263,2.2,2.43,0.26,1.57,5.0,1,2.82,1280.0,0
12,13.75,1.73,2.41,16.0,-0.421053,2.6,2.76,0.29,1.81,5.6,1,2.9,1320.0,0


##Data Summarization
Condition based aggregation

In [None]:
# Get the mean 'hue' of class 1 wine
mean_hue = df['hue'][df['class']==1].mean()
print("Mean 'hue' of class 1 wine :: {}".format(mean_hue))

Mean 'hue' of class 1 wine :: 0.5633802816901409


In [None]:
# Get the max 'alcohol' of class 0 wine
max_alcohol = df['alcohol'][df['class']==0].max()
print("Max 'alcohol' of class 0 wine :: {}".format(max_alcohol))

Max 'alcohol' of class 0 wine :: 14.83
