# DSC320: Data Scaling and Working with DataFrames

**Name**: Joseph Choi <br>
**Class**: DSC320-T301 Math for Data Science (2243-1)

In [34]:
# Setup
import math
import pandas as pd

#### JC Notes:
**Scaling**: 
- Adjusting the range of values for your input features so that they all fall within the same scale
- Prevents any single feature to disproportionately influence the model's performance

## 1. Data Normalization

In [35]:
"""
Code Description: Defining a function that takes in vector and normalizes it

Formula Breakdown:
- Xi: Specific value or values in the vector that I want to normalize
- min{x1,x2,...,xn}: Min value in the vector
- max{x1,x2,...,xn}: Max value in the vector

Code Breakdown:
- 1st Part: Calculating the min and max values in vector
- 2nd Part: Normalizing vectors based on provided formula
    - 'for x in vector': looping through each value in vector
- 3rd Part: Returning output (normalized vector)
"""

def data_normalization(vector):
    
    # 1st Part:
    norm_min_val = min(vector)
    norm_max_val = max(vector)
    
    # 2nd Part:
    normalized = [(x - norm_min_val) / (norm_max_val - norm_min_val) for x in vector]   
    
    # 3rd Part:
    return normalized

## 2. Data Standardization

In [36]:
"""
Code Description: Defining a function that takes in vector and standardizes it

Formula Breakdown:
- Xi: Specific value or values in the vector that I want to standardize
- x_: Mean
- sx: Standard deviation = SQRT((xi - x_)^2 * 1/n)

Code Breakdown:
- 1st Part: Calculating the mean and standard deviation
- 2nd Part: Standardizing the vectors based on provided formula
    - 'for x in vector': looping through each value in vector
- 3rd Part: Returning output (standardized vector)
"""

def data_standardization(vector):

    # 1st Part:
    stand_mean_val = sum(vector) / len(vector)
    stand_std_dev = math.sqrt(sum((x - stand_mean_val) ** 2 for x in vector) / len(vector))

    
    # 2nd Part:
    standardized = [(x - stand_mean_val) / stand_std_dev for x in vector]
    
    # 3rd Part:
    return standardized

## 3. Working with a DataFrame

In [37]:
# Loading csv file 'calif_housing_data.csv'
calif_housing_df = pd.read_csv('calif_housing_data.csv')

# Creating copy to work off of
calif_housing_copy = calif_housing_df.copy()
calif_housing_copy.head(3)

Unnamed: 0,housing_median_age,total_bedrooms,households,median_income,median_house_value
0,41,129.0,126,8.3252,452600.0
1,21,1106.0,1138,8.3014,358500.0
2,52,190.0,177,7.2574,352100.0


### (a) How many rows does this data set have?

In [38]:
# Finding the number of rows via shape

rows = calif_housing_copy.shape[0]
rows

20640

#### (b)  What is the target vector for your model?

**Response**: <br> Since we are building a model to predict the median house value per instructions, the target vector in this scenario is the 'median_house_value' column

In [40]:
# Setting the target vector
target_vector = calif_housing_copy['median_house_value']

#### (c) Create a new feature by taking the total bedrooms divided by the number of households. What does this new feature represent?

In [41]:
# Creating a new feature per instructions
calif_housing_copy['avg_bed_per_household'] = calif_housing_copy['total_bedrooms'] / calif_housing_copy['households']
calif_housing_copy.head(3)

Unnamed: 0,housing_median_age,total_bedrooms,households,median_income,median_house_value,avg_bed_per_household
0,41,129.0,126,8.3252,452600.0,1.02381
1,21,1106.0,1138,8.3014,358500.0,0.97188
2,52,190.0,177,7.2574,352100.0,1.073446


**Response**: <br> 
The new feature represents average bedrooms per household.

#### (d) Create a new data frame that has three features: the median age, median income, and the new feature created in part c

In [50]:
# Creating a new df with specified features
three_features_df = calif_housing_copy[['housing_median_age', 'median_income', 'avg_bed_per_household']]
three_features_df.head(3)

Unnamed: 0,housing_median_age,median_income,avg_bed_per_household
0,41,8.3252,1.02381
1,21,8.3014,0.97188
2,52,7.2574,1.073446


#### (e) Take the data frame created in part d and apply data standardization to the features

In [53]:
# Creating copies of three_features_df to apply data standardization

three_features_copy1 = three_features_df.copy() # 1st Attempt
three_features_copy2 = three_features_df.copy() # 2nd Attempt

In [54]:
"""
Code Description:
- 1st attempt at standardizing features

Code Breakdown:
- 1st Part: Looping over columns in three_features_copy1 and applying standardization procedure via 'data_standardization'
- 2nd Part: Printing standardized output
"""

# 1st Part:
for column in three_features_copy1.columns:
    three_features_copy1[column] = data_standardization(three_features_copy1[column])

# 2nd Part: 
three_features_copy1.head(3)

Unnamed: 0,housing_median_age,median_income,avg_bed_per_household
0,0.982143,2.344766,
1,-0.607019,2.332238,
2,1.856182,1.782699,


In [56]:
"""
1st Attempt Notes:
- Noticed 'NaN' in 'avg_bed_per_household' column
- Checking to see if there are any null values in the original df
- Output: 207 null values in 'avg_bed_per_household'
"""

three_features_df.isnull().sum()

housing_median_age         0
median_income              0
avg_bed_per_household    207
dtype: int64

In [57]:
"""
Code Description:
- 2nd attempt at standardizing features
- Found missing values in 'avg_bed_per_household' column
- Replace null values with the mean of the column
- Then, re-apply data standardization

Code Breakdown:
- 1st Part: Calculating the mean of 'avg_bed_per_household'
- 2nd Part: Filling missing values with the calculated mean
- 3rd Part: Looping over columns in three_features_copy2 and applying standardization procedure via 'data_standardization'
- 4th Part: Printing standardized output
"""

# 1st Part:
avg_bed_mean = three_features_copy2['avg_bed_per_household'].mean()

# 2nd Part:
three_features_copy2['avg_bed_per_household'].fillna(avg_bed_mean, inplace=True)

# 3rd Part:
for column in three_features_copy2.columns:
    three_features_copy2[column] = data_standardization(three_features_copy2[column])

# 4th Part: 
three_features_copy2.head(3)

Unnamed: 0,housing_median_age,median_income,avg_bed_per_household
0,0.982143,2.344766,-0.15464
1,-0.607019,2.332238,-0.264265
2,1.856182,1.782699,-0.049855
