# Machine Learning - Exercise 2
The aim of the exercise is to get an overview of the basic capabilities of the Pandas, Matplotlib and Seaborn libraries for the Data analysis, Vizualization and Data Transformation

**Jupyter lab**

* Add code
* Add text
* Execute command
* Shortcuts (a, b, dd, Ctrl+Enter, Shift+Enter, x, c, v)

**Alternatives**

* Google Colab ([Colaboratory](https://colab.research.google.com/))
* Python scripts in VS Code

"![meme01](https://github.com/lowoncuties/VSB-FEI-Machine-Learning-Exercises/blob/main/images/fml_01_meme_01.png?raw=true)

## Import used packages

In [None]:
import pandas as pd # dataframes
import numpy as np # matrices and linear algebra
import matplotlib.pyplot as plt # plotting
import seaborn as sns # another matplotlib interface - styled and easier to use
import sklearn.preprocessing # library for data preprocessing

## Load data into Pandas DataFrame

In [None]:
df_full = pd.read_csv('datasets/ml_02/global_house_purchase_dataset.csv', sep=',')
df_full

## Show first 5 rows

## Show last 20 rows

## If we want to know if there are any missing values, the isna() function may render useful

## We can show summary of common statistical characteristic of the data using the describe() function

## 💡 Dataframe has several useful properties
    - shape
    - dtypes
    - columns
    - index

### Row and column count

### Datatypes of columns

### Column names

### Row index values

## We may want to work with just one column not the whole dataframe
- We will select only the price columns and save it to another variable

## Columns are called Pandas Series - it shares a common API with Pandas DataFrame
- 💡 Pandas is numpy-backed so we can use Series as standard numpy arrays without any issues using the .values property

## Find maximum price using Numpy and Pandas

## Data filtering using Pandas DataFrame
- There are several ways of filtering the data (similar logic to .Where() in C# or WHERE in SQL)
- 💡 We usually work with two indexers - .loc[] and .iloc[]

### The .iloc[] indexer works with positional indexes - very close to the way of working with the raw arrays
### The .loc[] indexer works with column names and logical expressions

### Select all rows and 3rd column of dataframe

### Select all rows and LAST column of dataframe

### Select 4th to 10th row and all columns

### Select 2nd to 13th row and 3rd column

## Select only a subset of columns to a new dataframe
> ['Id', 'SalePrice','MSSubClass','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','Heating','CentralAir','GrLivArea','BedroomAbvGr']

### Select only houses built in year 2000 or later
* constructed_year

### Select only houses that are not Farmhouse (try != and ~ operators)
* property_type

### Select houses cheaper than 200k USD and have atleast 6 rooms
* price, rooms

### Select houses that are cheaper than 150K and are built after 1990 or have monthly expenses under 10K and are in France
* price, constructed_year, monthly_expenses

# We can add new columns to the DataFrame as well

![meme02](https://github.com/lowoncuties/VSB-FEI-Machine-Learning-Exercises/blob/main/images/fml_01_meme_02.png?raw=true)

### Add a new column named price_per_sqft_feet
* price, property_size_sqft

## Pandas enables us to use aggregation functions for the data using the .groupby() function

### Compute counts for all the property types (groupby / value_counts)
* property_type

# Visualization

### Don't switch to scientific notation for plots

In [None]:
plt.rcParams['axes.formatter.useoffset'] = False   # remove the +1e6 offset text
plt.rcParams['axes.formatter.limits'] = (-100, 100)  # never switch to sci notation


## Plots and how not to get lost among them
Here’s a guide to help you decide which plot to choose for different situations

### Histogram
- **Purpose**: Displays the distribution of a single continuous variable by dividing the data into bins and showing how many values fall within each bin.
- **Use it when**: You want to understand the distribution or shape of a dataset, identify multiple peaks (occurrences throughout the time, etc)
- **Example**: Understanding how exam scores are distributed in a class.
- **In-depth explanation** [Histogram](https://labxchange.org/library/items/lb:LabXchange:10d3270e:html:1)
  
    ![Histogram explained](https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/histogram_explained.png?raw=true)

### Boxplot
- **Purpose**: Visualizes the summary statistics (e.g., median, quartiles) of a continuous variable, highlighting the spread of the data and potential outliers.
- **Use it when**: You need to compare distributions across categories or identify outliers and the overall spread of the data.
- **Example**: Comparing income distributions across different job titles.
- **In-depth explanation** [Boxplot](https://www.simplypsychology.org/boxplots.html)
 
    ![Boxplot explained](https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/boxplot_explained.png?raw=true)
    
### Barplot
- **Purpose**: Displays categorical data with bars representing the frequency or value of each category.
- **Use it when**: You want to compare the size of different categories.
- **Example**: Showing the number of students in each grade category (A, B, C).
- **In-depth explanation** [Barplot](https://www.labxchange.org/library/items/lb:LabXchange:e034541a:html:1)
  
  ![Barplot explained](https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/barplot_explained.png?raw=true)
### Scatterplot
- **Purpose**: Shows the relationship between two continuous variables, with each point representing an observation.
- **Use it when**: You want to explore potential correlations or patterns between two variables.
- **Example**: Exploring the relationship between hours studied and exam scores.
- **In-depth explanation** [Scatterplot](https://www.atlassian.com/data/charts/what-is-a-scatter-plot)
  
    ![Scatterplot explained](https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/scatterplot_explained.png?raw=true)
  
### Line Plot
- **Purpose**: Connects data points with a line to show trends over time or other ordered variables.
- **Use it when**: You want to track changes or trends, especially over time.
- **Example**: Visualizing the monthly sales of a product over a year.
- **In-depth explanation** [Line plot/graph](https://www.atlassian.com/data/charts/line-chart-complete-guide)

    ![Line plot/graph explained](https://github.com/lowoncuties/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/blob/master/images/lineplot_explained.png?raw=true)

## Vizualize house prices in form of histogram
- What does it tell us about the prices?

### Modify bins
- Perhaps we want to visualize the histogram with more dense or coarse bins
- *bins* attribute

## 🔎 What can we say about prices based on the quantiles?
### Use function *axvline* from Matplotlib to draw vertical lines at the quartile positions

## 💡 *col* attribute to automatically plot every histogram on its own canvas

# 📊 Another very useful tecnique for the numerical features distributions are boxplots
- 🔎 Did you see any of these before?

- 🔎 How should we read boxplots?
    - **Quartiles**
    - **Median**
    - Box size
    - **Outliers and IQR**
    - Min/max - with or without outliers

### Plot only the *price* using boxplot

## Let's take a look at the *price* for different *property_type* values 
- 🔎 What can we say about the prices?

## We can do the same for *property_size_sqft*

## Let's take a look at the *rooms* for different *property_type* values 
- 🔎 What can we say about that ?

# 📊 Scatter plots are commonly used for visualizing two numerical variables

### We can use standard *scatterplot* with *property_type* as a *hue* so we can better grasp the relationship between *property_size_sqft* and *price*
* Try to set *alpha* parameter for opacity settings

# 📊 Correlation
* 🔎 What does the *correlation coefficient* tell you?
* What is the range of it?
* Is it useful for each type of relationship?
* 💡 **Correlation is not causation**
    * e.g. Ice cream sales X Number of thefts
    * Means that two variables are associated with each other, but it does not imply that one variable causes the other to change.

- Take a look at [this link](https://www.dummies.com/education/math/statistics/how-to-interpret-a-correlation-coefficient-r/)
- See also [this](https://www.simplypsychology.org/correlation.html) or [this](https://www.investopedia.com/ask/answers/032515/what-does-it-mean-if-correlation-coefficient-positive-negative-or-zero.asp) for some more info about the topic

![meme03](https://github.com/lowoncuties/VSB-FEI-Machine-Learning-Exercises/blob/main/images/ml_02_meme_03.png?raw=true)

## We can compute the correlation matrix using *corr()* function
* Select columns first

> ['property_size_sqft', 'price', 'constructed_year', 'rooms', 'garage', 'garden', 'crime_cases_reported']

### Visualize correlations using Heatmap

## Data Transformations

It is important to select the appropriate scaling method of the number features
* There are many ways how to do this - **MinMax, StandardScaler, PowerTransform, ...**
* This step heavily depends on a domain knowledge because the scales of the features have significant effect on a distances between couples of dataset instances
    - It is clear that if one variable is in range **(0,1)** and the second one is in a range **(5000, 10 000)**, the difference in the **second feature** will be definitely **more important** than in the  first one from the numerical point of view
    - Although it is possible that from the **domain point of view** the **first variable may be more important**
    - 💡 Thus it is a good idea to at least transform the features into a **simiiar scales so the effect on the distance value would be similiar**
    - Transformation depends heavily on the statistical distribution of the feature
        - 💡You can use PowerTransform for a heavy-tailed distribution, **Standardization or MinMax normalization for normally distributed features** etc.

### Create new DataFrame for scaled values

In [None]:
df_scaled = pd.DataFrame(index = df.index)

## Take a look at the *price* feature distribution
* What transformation would be appropriate based on that?

## Plot histogram of the transformed feature
* 🔎 What has changed?

## Take a look at the *property_size_sqft* feature distribution
* What transformation would be appropriate based on that?

## Plot histogram of the transformed feature
* 🔎 What has changed?