# Introduction to Machine Learning - Exercise 1
The aim of the exercise is to get an overview of the basic capabilities of the Pandas, Matplotlib and Seaborn libraries and be able to setup a Python Virtual Enviroment (`venv`)

**Jupyter lab**

* Add code
* Add text
* Execute command
* Shortcuts (a, b, dd, Ctrl+Enter, Shift+Enter, x, c, v)

**Alternatives**

* Google Colab ([Colaboratory](https://colab.research.google.com/))
* Python scripts in VS Code
  
![meme01](https://github.com/lubsar/EFREI-Introduction-to-Machine-Learning/blob/main/images/fml_01_meme_01.png?raw=true)

# Basics of Python programming
* You should be already familiar with Python basics
* Majority of the work is done using Numpy or Pandas libraries as it is nowadays industry standard
  * Data manipulation tasks are repetitive - it makes no sense to code everything from scratch every single time

## Variables and data types
* Variables are used to store data values. A variable is created the moment you first assign a value to it.
* Variables do not need to be declared with any particular type and can even change type after they have been set.
* In Python, variables are created when you assign a value to it.

In [None]:
val_x = 5
val_str = 'Hello'
val_null = None

print(val_x, val_str, val_null)

## Operators in Python
* Arithmetic Operators
* Comparison Operators
* Logical Operators
* Assignment Operators 

In [None]:
## Arithmetic Operators
val_x = 5
val_y = 3

print(val_x + val_y)
print(val_x - val_y)
print(val_x * val_y)
print(val_x / val_y)
print(val_x % val_y)
print(val_x ** val_y)

In [None]:
## Comparison Operators
val_x = 5
val_y = 3

print(val_x == val_y)
print(val_x != val_y)
print(val_x > val_y)
print(val_x < val_y)
print(val_x >= val_y)
print(val_x <= val_y)

In [None]:
## Logical Operators
val_x = 5
val_y = 3

print(val_x == 5 and val_y == 3)
print(val_x == 5 or val_y == 2)
print(not(val_x == 5 and val_y == 3))

In [None]:
## Assignment Operators
val_x = 5
val_y = 3

val_x += 1
print(val_x)

val_x -= 1
print(val_x)

# Python data sctructures
* list
* tuple
* dictionary

In [None]:
## List
val_list = [1, 2, 3, 4, 5]

print(val_list)
print(val_list[0])
print(val_list[1:3])
print(val_list[:3])
print(val_list[2:])

val_list[0] = 10
print(val_list)

val_list.append(6)
print(val_list)

val_list.insert(1, 20)
print(val_list)

val_list.remove(20)
print(val_list)


In [None]:
## Tuple
val_tuple = (1, 2, 3, 4, 5)
print(val_tuple)
print(val_tuple[0])
print(val_tuple[2:])

In [None]:
## Dictionary
val_dict = {'name': 'John', 'age': 25, 'address': 'New York'}
print(val_dict)
print(val_dict['name'])
print(val_dict['age'])
print(val_dict['address'])

val_dict['name'] = 'Jane'
print(val_dict)

if 'name' in val_dict:
    print('Name is present in the dictionary')

if 'Salary' not in val_dict:
    print('Salary is not present in the dictionary')

# Python `if` statement and `for` loop

In [None]:
val_x = 5
if val_x > 3:
    print('x is greater than 3')
elif val_x < 3:
    print('x is less than 3')
else:
    print('x is equal to 3')

In [None]:
val_list = [1, 2, 3, 4, 5]
for val in val_list:
    print(val)

In [None]:
for val in range(5):
    print(val)

In [None]:
for val in range(1, 10, 2):
    print(val)

# Functions

In [None]:
def add(x, y):
    return x + y

print(add(5, 3))

In [None]:
def add(x, y=3):
    return x + y

print(add(5))
print(add(5, 4))

In [None]:
def add_sub(x, y):
    return x + y, x - y

add, sub = add_sub(5, 3)
print(add)
print(sub)

# Data processing

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Important attributes description:
* SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* Heating: Type of heating
* CentralAir: Central air conditioning
* GrLivArea: Above grade (ground) living area square feet
* BedroomAbvGr: Number of bedrooms above basement level

## Pandas tasks:
* Load data
* Standard data inspection (functions head(), tail(), describe(), isna(), shape)
* Select one attribute to variable 
    - Series and numpy compatibility
* dtype, index, columns
* Data selection - [], loc, iloc
* Data filtering and logical operators
* Add new column to dataframe
* Calculate new numberical attribute
* Data selection - comparison and negation
* Assign new values to selected rows from dataframe
* Use .apply() for rows and single column
* Use .groupby() for data aggregation

## Import used packages

In [None]:
import pandas as pd # dataframes
import numpy as np # matrices and linear algebra
import matplotlib.pyplot as plt # plotting
import seaborn as sns # another matplotlib interface - styled and easier to use

## The first step is to load the data into the Pandas DataFrame - in our case it is a csv file
* https://github.com/lubsar/EFREI-Introduction-to-Machine-Learning/blob/main/datasets/zsu_cv1_data.csv

In [None]:
df_full = pd.read_csv('https://raw.githubusercontent.com/lubsar/EFREI-Introduction-to-Machine-Learning/main/datasets/zsu_cv1_data.csv', sep=',')

## We shloud take a look on the data after loading so we know that everything is OK

### We will start with showing first/last N rows 
- There are several ways of doing that:
    - name of the dataframe
    - head()
    - tail()

### Show 5 first and last rows

In [None]:
df_full

### Show first 5 rows

In [None]:
df_full.head()

### Show last 20 rows

In [None]:
df_full.tail(20)

## If we want to know if there are any missing values, the isna() function may render useful

In [None]:
df_full.isna().sum().sort_values(ascending=False).head(20)

## We can show summary of common statistical characteristic of the data using the describe() function

In [None]:
df_full.describe()

## 💡 Dataframe has several useful properties
    - shape
    - dtypes
    - columns
    - index

#### Row and column count

In [None]:
df_full.shape

#### Datatypes of columns

In [None]:
df_full.dtypes

#### Column names

In [None]:
df_full.columns

#### Row index values

In [None]:
df_full.index

## We may want to work with just one column not the whole dataframe
- We will select only the SalePrice columns and save it to another variable

In [None]:
price = df_full['SalePrice'] # df_full.SalePrice
price

## Columns are called Pandas Series - it shares a common API with Pandas DataFrame
- 💡 Pandas is numpy-backed so we can use Series as standard numpy arrays without any issues using the .values property

In [None]:
arr_np = price.values
arr_np

## Find maximum price using Numpy and Pandas

In [None]:
price.max()

In [None]:
np.max(arr_np)

## Data filtering using Pandas DataFrame
- There are several ways of filtering the data (similar logic to .Where() in C# or WHERE in SQL)
- 💡 We usually work with two indexers - .loc[] and .iloc[]

### The .iloc[] indexer works with positional indexes - very close to the way of working with the raw arrays
### The .loc[] indexer works with column names and logical expressions

### Select all rows and 3rd column of dataframe

In [None]:
df_full.iloc[:, 2]

### Select all rows and LAST column of dataframe

In [None]:
df_full.iloc[:, -1]

### Select rows 15 to 22 and all columns

In [None]:
df_full.iloc[15:23, :]

### Select rows 15 to 22 and 3rd column

In [None]:
df_full.iloc[15:23, 3]

## Select only a subset of columns to a new dataframe
* 'Id', 'SalePrice','MSSubClass','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','Heating','CentralAir','GrLivArea','BedroomAbvGr'

In [None]:
df = df_full.loc[:, ['Id', 'SalePrice','MSSubClass','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','Heating','CentralAir','GrLivArea','BedroomAbvGr']].copy()
df.head()

### Select only houses built in year 2000 or later
* YearBuilt

In [None]:
df.loc[df.YearBuilt >= 2000, :]

### Select only houses that don't use GasA for heating (try != and ~ operators)
* Heating

In [None]:
df.loc[df.Heating != 'GasA', :]

In [None]:
df.loc[~(df.Heating == 'GasA'), :]

### Select houses cheaper than 180k USD and with at least 2 bedrooms
* SalePrice, BedroomAbvGr

In [None]:
df.loc[(df.SalePrice < 180000) & (df.BedroomAbvGr >= 2), :]

### Select houses with 2 stories or air conditioning
* HouseStyle, CentralAir

In [None]:
df.loc[(df.HouseStyle == '2Story') | (df.CentralAir == 'Y'), :]

In [None]:
df.SalePrice.describe()

# We can add new columns to the DataFrame as well

![meme01](https://github.com/lubsar/EFREI-Introduction-to-Machine-Learning/blob/main/images/fml_01_meme_02.png?raw=true)

### Add a new column named Age for each house (current year - year built)
* YearBuilt

In [None]:
df.loc[:, 'Age'] = 2021 - df.YearBuilt

### Add a new column IsLuxury with True value for houses with more than 3 bedrooms and price above 214k USD (.loc)
- How many luxury houses are in the dataset?
- SalePrice, BedroomAbvGr

In [None]:
df['IsLuxury'] = False
df.loc[(df.SalePrice > 214000) & (df.BedroomAbvGr > 3), 'IsLuxury'] = True

## Pandas enables us to use aggregation functions for the data using the .groupby() function

### Compute counts for all the heating methods (groupby / value_counts)
* Heating

In [None]:
df.groupby('Heating').Heating.count()

In [None]:
df.Heating.value_counts()

# Visualization

## Scatter plot
- Visualize relationship between SalePrice and GrLivArea.
- Use scatter plot from **Matplotlib**.

In [None]:
plt.scatter(df.GrLivArea, df.SalePrice)

### Modify figure size and add title

In [None]:
fig = plt.figure(figsize=(9,6))
plt.scatter(df.GrLivArea, df.SalePrice)
plt.title('House size and price relationship')
plt.show()

### Add axis labels

In [None]:
fig = plt.figure(figsize=(9,6))
plt.scatter(df.GrLivArea, df.SalePrice,)
plt.title('House size and price relationship')
plt.xlabel('Living area')
plt.ylabel('Price')
plt.show()

### Add colors for data points based on CentralAir value.

In [None]:
fig = plt.figure(figsize=(9,6))
plt.scatter(df.GrLivArea, df.SalePrice, color=df.CentralAir.map({'Y':'blue', 'N':'red'}))
plt.title('House size and price relationship')
plt.xlabel('Living area')
plt.ylabel('Price')
fig.show()

## Try to use scatterplot from **Seaborn** library for scatter plot visualization.

#### Use dataframe as source and column names for axes data

In [None]:
sns.scatterplot(data=df, x='GrLivArea', y='SalePrice')

### Resize plot and add color for markers based on CentralAir column

In [None]:
fig = plt.figure(figsize=((9,6)))
sns.scatterplot(data=df, x='GrLivArea', y='SalePrice', hue='CentralAir')

## Line plot
- Calculate and visualize average house price in relationship to YearBuild.

In [None]:
avg_prices = df.groupby('YearBuilt').SalePrice.mean()
avg_prices

In [None]:
plt.figure(figsize=(9,6))
plt.plot(avg_prices.index, avg_prices.values, marker='x')

## Bar plot
- Calculate and visualize how many of houses have CentralAir
- Use Barplot for visualization

In [None]:
number_of_categories = df.groupby('CentralAir').CentralAir.count()
number_of_categories

In [None]:
plt.bar(x=number_of_categories.index, height=number_of_categories.values)

### Visualize number of building type and if they contains air conditioning using Seaborn

In [None]:
df_number_of_categories = df.groupby(['HouseStyle', 'CentralAir']).Id.count().reset_index(name='Count')
df_number_of_categories

In [None]:
sns.barplot(data=df_number_of_categories, y='HouseStyle', x='Count', hue='CentralAir')

# Tasks
## ✅ Python
* Create function that will take `list` of numbers as an input, computes *minimum, maximum, mean, median* and return the values in form of a tuple
  * Do not use any library functions for computing the values (use pure Python, thus avoid std. library, numpy, pandas, etc.)
* Test the function using `SalePrice` column
  * 💡 You can compare your values with the correct ones obtained using Numpy/Pandas functions

## ✅ Pandas
* Add a new column *Undervalued* which is set to True in case that the house is priced bellow 163k USD and has both OverallQual and OverallCond higher than 5.

* **How many undervalued houses are in the dataset?**

## ✅ Visualization
* Add to dataframe new attribute determining if the house was build before or after year 2000.

* **Create bar chart for number of houses depending on type of dwelling (attribute BldgType, use as a category axis) and added binary attribute about house age (use as a bar color).**

In [None]:
def basic_statistics(arr: list) -> tuple[float, float, float, float]:
    pass

In [None]:
basic_statistics(df.SalePrice.to_list())