# Week 2 Tasks — Data Preparation, EDA, and Intro Modeling (Python)

Use this template to complete Week 2 tasks. Replace placeholders with your work. Ensure the notebook runs top-to-bottom without errors. Add short captions/annotations below each plot and metric output.


In [None]:
# Setup
import sys, warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, mean_squared_error, r2_score

sns.set_theme(style='whitegrid')
%matplotlib inline


## Task 1 — Load Data and Inspect


In [None]:
# TODO: Set path and read your dataset
# Example:
# df = pd.read_csv('path/to/your.csv', parse_dates=['date_col'])
# df.head(10)
# df.shape


Briefly describe the dataset, its purpose, and key variables.


## Task 2 — Data Types, Summary Stats, and Missingness


In [None]:
# Column dtypes and corrections if needed
# df.dtypes
# Example: df['date_col'] = pd.to_datetime(df['date_col'], errors='coerce')

# Summary statistics
# df.describe(include='all').T

# Missingness
# (df.isna().sum().to_frame('n_missing')
#    .assign(pct=lambda s: s['n_missing'] / len(df)))


## Task 3 — Data Cleaning


In [None]:
# TODO: Handle missing values, duplicates, and standardize column names
# df = df.drop_duplicates()
# df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
# Example imputation: df['num_col'] = df['num_col'].fillna(df['num_col'].median())


## Task 4 — Exploratory Data Analysis (EDA)


In [None]:
# Example univariate plot
# sns.histplot(df['numeric_col'], bins=30, kde=True)
# plt.title('Distribution of numeric_col')
# plt.show()

# Example bivariate plot
# sns.regplot(data=df, x='xvar', y='yvar', scatter_kws={'alpha':0.6})
# plt.title('xvar vs yvar')
# plt.show()


Write 3–5 short observations from EDA here.


## Task 5 — Data Visualization


In [None]:
# Create at least 2 clear plots with captions below
# sns.boxplot(data=df, x='category', y='numeric_col')
# plt.title('numeric_col by category')
# plt.show()


Key takeaway:


## Task 6 — Class Imbalance (if applicable)


In [None]:
# Inspect class distribution if you have a classification target
# df['target'].value_counts(normalize=True)


## Task 7 — Feature Engineering


In [None]:
# Example: create date parts, one-hot encode categories, bin numeric variables
# df['month'] = df['date_col'].dt.month
# df['desc_len'] = df['text_col'].str.len()


## Task 8 — Baseline Modeling


In [None]:
# Example baseline for classification (edit for your dataset)
# target = 'target'
# numeric = ['num1','num2']
# categorical = ['cat1','cat2']
# X = df[numeric + categorical]
# y = df[target]
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# preproc = ColumnTransformer(
#     [('cat', OneHotEncoder(handle_unknown='ignore'), categorical)],
#     remainder='passthrough'
# )
# clf = Pipeline([('prep', preproc), ('model', LogisticRegression(max_iter=1000, class_weight='balanced'))])
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_test)
# y_proba = getattr(clf, 'predict_proba', lambda X: None)(X_test)


## Task 9 — Evaluation


In [None]:
# Example metrics (classification)
# print('Accuracy:', accuracy_score(y_test, y_pred))
# print('F1:', f1_score(y_test, y_pred, average='weighted'))
# if y_proba is not None:
#     try:
#         print('AUROC:', roc_auc_score(y_test, y_proba[:,1]))
#     except Exception:
#         pass

# Example metrics (regression)
# preds = model.predict(X_test)
# print('RMSE:', mean_squared_error(y_test, preds, squared=False))
# print('R^2:', r2_score(y_test, preds))


Provide a 2–4 sentence interpretation of the metrics and what they imply.


## Task 10 — Findings and Next Steps


Summarize 3–5 insights from EDA and modeling. Propose 2 concrete next steps to improve the analysis or model.
