# Machine Learning - Data Analysis Introduction

# Part 1

## Get your Python environment running


Familiarize yourself with the most important packages in the field of machine learning. Some others may follow, however, those are the ones you need every time: 
- numpy
- pandas
- matplotlib
- scikit-learn

Make sure you can import all of them in your notebook.

## Warm-up Exercise
- Write a function to sample a number N of datapoints in the p-dimensional union cube $[0;1]^p$ and sort the values according to the squared distance to the origin divided by the squared distance of the edge point most far away from the origin. 

- Execute the function for N=10000 and $p\in [1,2,...20]$. 
- Plot the minimum distance over p.

In [2]:
import numpy as np 
import pandas as pd

## Data Processing

The most useful Python library for data science is pandas: https://pandas.pydata.org/

It provides two basic data structures, DataFrames and DataSeries.
- DataSeries: one-dimensional data array of any type (one data column), labeled (column header)
- DataFrame: two-dimensional data structure, multi-columns with headers (column titles)

Pandas provides an easy way of working with data with state-of-the art machine learning. 
We typically use it for 
- descriptive statistics of the data 
- in connection with plot libraries like matplotlib or seaborn for data visualization (while pandas has some integrated visualization capabilitites)
- data transformation 
- combination, split of datasets 
- ...


__We will need these commands and skills throughout the lecture, so make sure that you familiarize with the pandas library.__

To get fluent with pandas, carry out the following __exercises__. 
- Use the documentation and API reference of pandas to learn the basics about these functions.
- These exercises guide you through a set of standard data science. 



# Data loading and easy transformations
Load the following two datasets into a dataframe 
- iris.csv to dataframe named iris_df
- decision-tree.txt to dataframe named tree_df


In [7]:
iris_data= pd.read_csv('Datasets-20241017\iris.csv')
tree_data=pd.read_table('Datasets-20241017\decision-tree.txt')
print(f'Both the files have been loaded')

Both the files have been loaded


  iris_data= pd.read_csv('Datasets-20241017\iris.csv')
  tree_data=pd.read_table('Datasets-20241017\decision-tree.txt')


For iris_df do the following
- Look at the dataframe.
- Get the column names.
- Rename the columns such that all names are written in UPPERCASE.
- Get a simple statistic for the data. 
- Generate a one-hot encoding for the class values

## Data joins 

Load the individual datasets iris1, iris2, iris3, iris4, iris5. Join them appropriately into one dataframe taking into account column names and indices.

Check that your joined dataframe corresponds to iris_df.

In [8]:
iris_data.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [9]:
iris_data.columns

Index(['sepal length', 'sepal width', 'petal length', 'petal width', 'class'], dtype='object')

In [11]:
iris_data.columns=[i.upper() for i in iris_data.columns]
print(iris_data.head())

   SEPAL LENGTH  SEPAL WIDTH  PETAL LENGTH  PETAL WIDTH        CLASS
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


In [14]:
iris_data.describe()


Unnamed: 0,SEPAL LENGTH,SEPAL WIDTH,PETAL LENGTH,PETAL WIDTH
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [17]:
pd.get_dummies(iris_data, columns=['CLASS'])

Unnamed: 0,SEPAL LENGTH,SEPAL WIDTH,PETAL LENGTH,PETAL WIDTH,CLASS_Iris-setosa,CLASS_Iris-versicolor,CLASS_Iris-virginica
0,5.1,3.5,1.4,0.2,True,False,False
1,4.9,3.0,1.4,0.2,True,False,False
2,4.7,3.2,1.3,0.2,True,False,False
3,4.6,3.1,1.5,0.2,True,False,False
4,5.0,3.6,1.4,0.2,True,False,False
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,False,False,True
146,6.3,2.5,5.0,1.9,False,False,True
147,6.5,3.0,5.2,2.0,False,False,True
148,6.2,3.4,5.4,2.3,False,False,True


# Part 2

## Elementary data analysis and visualization of the Iris dataset

How is each of the quanties sepal / petal length / width distributed?
- Compute statistical quantities like mean, standard deviation.
- Are there any values far from the average?
- Visualize the data distribution by appropriate histogram plots. 
    - Use matplotlib.pyplot.hist()
    - Familiarize with the hist() method and its parameters
    - Try at least the following two strategies for the bins parameter: define appropriate binning yourself and at least one pre-defined strategy (e.g. 'auto'). 
    - See the documentation of matplotlib to understand how hist() works. 



In [27]:
# MEAN  
print(f'The mean of sepal lenght :{iris_data['SEPAL LENGTH'].mean().item():.2f}\nThe mean of sepan width :{iris_data['SEPAL WIDTH'].mean().item():.2f}\n ')
print(f'The mean of petal lenght :{iris_data['PETAL LENGTH'].mean().item():.2f}\nThe mean of petal width :{iris_data['PETAL WIDTH'].mean().item():.2f}\n\n ')

# STANDARD DEVIATION
print(f'The standard deviation of sepal lenght :{iris_data['SEPAL LENGTH'].std().item():.2f}\nThe standard deviation of sepan width :{iris_data['SEPAL WIDTH'].std().item():.2f}\n ')
print(f'The standard deviation of petal lenght :{iris_data['PETAL LENGTH'].std().item():.2f}\nThe standard deviation of petal width :{iris_data['PETAL WIDTH'].std().item():.2f} ')


The mean of sepal lenght :5.84
The mean of sepan width :3.05
 
The mean of petal lenght :3.76
The mean of petal width :1.20

 
The standard deviation of sepal lenght :0.83
The standard deviation of sepan width :0.43
 
The standard deviation of petal lenght :1.76
The standard deviation of petal width :0.76 


In [None]:
# values far for mean
