## Group: The Order of the PyTorch
### Milestone 1

- Members: Onur Buyukkalkan, Yi-Huai Chang, Diyanet Nijiati

- Project: Costa Rica Household Poverty Prediction

https://www.kaggle.com/competitions/costa-rican-household-poverty-prediction/overview 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

## Step 1: Initial Exploration

In [None]:
#Load and Explore the Data
data = pd.read_csv('train.csv')
print(data.head())
print(data.describe())

In [None]:
#Checking Shape and NA Values Across Columns
print("Data shape:", data.shape)
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None) 
print(data.isnull().sum())

#v2a1, v18q1, rez_esc have too many null values, is this a problem?

In [None]:
#Counts of the Target Labels
print(data['Target'].value_counts())

## Step 2: Plotting

We should now visualize the data

In [None]:
# Plotting the distribution of the poverty levels
sns.countplot(x='Target', data=data)
plt.title('Distribution of Poverty Levels')
plt.show()

In [None]:
# Select only numeric columns for correlation matrix
numeric_data = data.select_dtypes(include=[np.number])  # np.number covers integers and floats

# Now compute the correlation matrix
corr_matrix = numeric_data.corr()

# Assuming 'corr_matrix' is your correlation matrix
sns.set(style="white")  # Set style to 'white' to ensure labels are clear

plt.figure(figsize=(12, 10))  # Adjust figure size to your preference
ax = sns.heatmap(
    corr_matrix,
    annot=False,
    cmap='coolwarm',
    cbar=True,
    xticklabels=True,
    yticklabels=True
)

# Rotate the labels on the x-axis for better visibility
plt.xticks(rotation=90, fontsize=8)  # Rotate x labels and set font size
plt.yticks(rotation=0, fontsize=8)  # Rotate y labels and set font size (if needed)

plt.title('Correlation Heatmap')
plt.show()

#We see that hhsize, tamhog, r4t3,hogar_total are the same thing by looking at the heatmap

## Step 3: Data Cleaning

We can fill n/a values with mean for consistency or maybe just drop all n/a values.

In [None]:
# Check for missing values
missing_data = data.isnull().sum()
missing_data = missing_data[missing_data > 0]
print(missing_data.sort_values(ascending=False))

# Impute missing values with the median
for column in missing_data.index:
    if data[column].dtype != 'object':  # assuming only numeric columns need imputation
        data[column].fillna(data[column].median(), inplace=True)

# Dropping columns with more than 70% missing values
for column in missing_data.index:
    if missing_data[column] > 0.7 * len(data):
        data.drop(columns=[column], inplace=True)


In [None]:
# Text Features to Integers
# Convert edjefe and edjefa to dummy variables
# Make your own dependency rate
#???????????????????????


#Group by household before or after prediction, rounding the final label up and down for each household mean to find the target label?

In [None]:
# Select only numeric columns
numeric_cols = data.select_dtypes(include=['int64', 'float64']).columns

# Exclude 'Target' from numeric columns, now safely assuming all are numeric
numeric_cols = numeric_cols.drop('Target', errors='ignore')  # Use errors='ignore' to avoid KeyErrors if the column is not present

In [None]:
#Standardization

from sklearn.preprocessing import StandardScaler

# Create a scaler object
scaler = StandardScaler()

# Fit the scaler and transform the data
data[numeric_cols] = scaler.fit_transform(data[numeric_cols])

# Check the transformed data
print(data[numeric_cols].head())


In [None]:
print(data[numeric_cols].mean())
print(data[numeric_cols].std())

In [None]:
# Final datatype and na check
print(data.dtypes)

print(data.isnull().sum().max())


In [None]:
#Group by idhogar
household_avg = data.groupby('idhogar')[numeric_cols].mean()

# Display the result
print(household_avg.head())

## Step 4: Feature Selection

The features that are available to us are described here: https://www.kaggle.com/competitions/costa-rican-household-poverty-prediction/data

We have a number of different features from material amenities like toilet and source of electricity to household characteristics like disability, number of kids, years in education etc. These are indirect features that might reflect the quality of life for these households.

We can use the outside wall material, roof material, number of tablets owned, toilet situation, electricity source etc. We can also create children to adult ratio, income per children, income per person in household, but we do not have income data. We can therefore find replacements that will still represent the level of income. 

#### Potential New Features 

- **Ratio of Children to Adults**: This can highlight households that may be under more financial strain.

- **Dependency Ratio**: Although it’s already provided, checking for its accurate calculation or recalculating might be useful if there are any discrepancies.

- **Asset Index**: Create a composite score based on the presence of assets (e.g., refrigerator, computer, tablet, TV) and home characteristics (types of walls, floors, and roof materials). This score can serve as a proxy for economic status.

- **Educational Level Index**: A score representing the overall educational attainment within the household.

#### Limitations

- Underreporting or overreporting to get financial assistance.
- Lack of monetary income and asset reported.
- Dimensionality problem might arise if we fail to find the most important features and eliminate the lesser important ones.

In [None]:
#Trying Out Indexes

# Example of creating an Asset Index
data['asset_index'] = (data['refrig'] + data['v18q'] + data['computer'] + data['television'] + data['mobilephone']).astype(int)

# Example of creating an Educational Level Index
data['education_index'] = (data['instlevel1'] + data['instlevel2']*2 + data['instlevel3']*3 + data['instlevel4']*4 + data['instlevel5']*5 + data['instlevel6']*6 + data['instlevel7']*7 + data['instlevel8']*8 + data['instlevel9']*9)


In [None]:
# Split the data into 70% training and 30% temporary set
train_data, temp_data = train_test_split(data, test_size=0.3, random_state=42)

# Split the temporary set into 10% validation and 20% test set
# Since the temp_data is 30% of the data, we take 1/3 of it for validation (which is 10% of the total data)
validation_data, test_data = train_test_split(temp_data, test_size=2/3, random_state=42)


In [None]:
print("Training set size: ", train_data.shape)
print("Validation set size: ", validation_data.shape)
print("Test set size: ", test_data.shape)
