<a id="index"></a>

# IPA Assessment

Participant: Pablo Cruz Lemini

Create Summary Statistics
- [Column 1 table](#column-1-table)
- [Column 2 and 3 from the table](#column-23-table)

Create Graphs: 
- [Difference in amount saved at school at endline](#amount-saved-graph)
- [Difference in amount saved between control group and treatment](#amount-difference-graph)

Models:
- [Create a regression model](#regression-model)
- [Describe how you would build a machine learning model](#ml-model)

Ideas:
- [How to provide an ongoing monitoring system of Aflatoun savings](#advice)

### Import libraries & Data

In [1]:
# Imports
import pandas as pd
import plotly.graph_objects as go


In [2]:
# Display full data frame:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None,)

In [3]:
# Load CSV file
df = pd.read_csv('ghana_youth_savings.csv')

In [4]:
# Explore Dataframe:
df.head(5)

Unnamed: 0,uuid,schid,noattrit,aflatoun,hmb,class,std_avg_class_size,region1,region2,region3,urban,gender,female,age,tribe,hhsize,hhearn,hhbankaccounts,repeated_grade,baseline_save,baseline_saveamt,takeupschool,takeup,end_save,end_saveamt,end_saveschl,end_saveschlamt,asset_index,work_index,riskpref_index,timepref_index,treatment,saving_attitude_index,saving_env_index,fin_lit_index,tempt_index,expend_index,academic_index
0,1011,101,1,1,0,P5,-0.06316,1,0,0,0,Male,0.0,10.0,Akyem,6.0,1.0,0,No,Yes,2.3,0,0,No,0.0,No,0.0,-0.460591,-0.592948,-0.4924,0.790053,1,0.024145,0.222919,0.442896,-0.650386,-0.138709,-0.01253
1,1012,101,1,1,0,P5,-0.06316,1,0,0,0,Male,0.0,12.0,Other,7.0,2.0,0,No,No,0.0,0,0,Yes,1.4,No,0.0,-0.338644,-0.592948,2.029718,-1.265019,1,0.024145,-0.222388,1.256745,0.114838,-0.058629,-0.765385
2,1013,101,1,1,0,P5,-0.06316,1,0,0,0,Male,0.0,13.0,Fanti,3.0,1.0,0,No,No,0.0,0,0,No,0.0,No,0.0,-1.229701,3.404489,-0.4924,0.790053,1,0.024145,-0.667696,-1.069514,-0.650386,1.871038,-1.827975
3,1014,101,1,1,0,P5,-0.06316,1,0,0,0,0,1.0,10.0,Fanti,6.0,2.0,0,No,No,0.0,0,0,No,0.0,No,0.0,0.280608,-0.592948,-0.4924,0.790053,1,-1.868931,-0.833853,0.375475,-0.650386,1.900146,-0.01253
4,1015,101,1,1,0,P5,-0.06316,1,0,0,0,0,1.0,13.0,Fanti,7.0,2.0,1,No,No,0.0,0,0,No,0.0,No,0.0,0.494285,-0.592948,-0.4924,0.790053,1,0.024145,-0.010213,-1.307112,-0.405488,-0.29287,-3.023953


In [5]:
# Clean up data

# Change baseline_save responses from string to integer
df['baseline_save'] = df['baseline_save'].replace('No', 0)
df['baseline_save'] = df['baseline_save'].replace('Yes', 1)

# Change noattrit empty cells to 0
df['noattrit'] = df['noattrit'].replace('No', 0)

# Create Control Group dataframe
control_group = df[df['treatment'] == 0]

# Create Aflatoun Group dataframe
aflatoun_group = df[df['aflatoun'] == 1]

# Create Honesty Money Box Group dataframe
honesty_money_box_group = df[df['hmb'] == 1]

  df['baseline_save'] = df['baseline_save'].replace('Yes', 1)


### Create Summary Statistics Table

In [6]:
# Function to create the summary table

def create_summary_table(df, group_name):

    # Select columns to aggregate
    columns_to_aggregate = [
        'female',
        'age',
        'baseline_save',
        'baseline_saveamt',
        'saving_attitude_index',
        'saving_env_index',
        'work_index',
        'riskpref_index',
        'timepref_index',
        'fin_lit_index',
        'tempt_index',
        'expend_index',
        'academic_index',
        'noattrit',
        'end_saveamt',
    ]

    # Calculate mean and standard deviation
    summary_table = df[columns_to_aggregate].agg(['mean', 'std'])

    # Rename columns to match target table format
    summary_table.rename(columns={
        'female': 'Female',
        'age': 'Age',
        'baseline_save': 'Student Has Money Saved',
        'baseline_saveamt': 'Amount Saved',
        'saving_attitude_index': 'Saving Attitudes Index',
        'saving_env_index': 'Home Savings Support Index',
        'work_index': 'Work Index',
        'riskpref_index': 'Risk Preference Index',
        'timepref_index': 'Time Preference Index',
        'fin_lit_index': 'Financial Literacy Index',
        'tempt_index': 'Expenditures on Temptation Goods Index',
        'expend_index': 'Expenditures on Self Index',
        'academic_index': 'Academic Index',
        'noattrit': 'Completed Endline Survey',
        'end_saveamt': 'Amount Saved at Endline',
    }, inplace=True)

    # Round values to 3 decimal places
    summary_table_rounded = summary_table.round(3).transpose()
    
    # Add column header with group name
    summary_table_rounded.columns = pd.MultiIndex.from_product([[group_name], ['mean', 'std']])
    
    # Return the complete summary table
    return summary_table_rounded


<a id="column-1-table"></a>
### Column 1 Table [*](#index)

In [7]:
control_group_summary = create_summary_table(control_group, "Control Group")
control_group_summary

Unnamed: 0_level_0,Control Group,Control Group
Unnamed: 0_level_1,mean,std
Female,0.5,0.5
Age,12.806,1.989
Student Has Money Saved,0.467,0.499
Amount Saved,5.041,17.966
Saving Attitudes Index,-0.0,1.0
Home Savings Support Index,-0.0,1.0
Work Index,-0.0,1.0
Risk Preference Index,-0.0,1.0
Time Preference Index,-0.0,1.0
Financial Literacy Index,0.0,1.0


<a id="column-23-table"></a>
### Column 2 & 3 Table [*](#index)

In [8]:
# Create summary tables of test groups

# Aflatoun Group
aflatoun_group_summary = create_summary_table(aflatoun_group, "Aflatoun")

# Honesty Money Box Group
honesty_money_box_group_summary = create_summary_table(honesty_money_box_group, "HMB")

# Concatenate Control, Aflatoun, and Honesty Money Box Group Tables
summary_table_concatenated = pd.concat([control_group_summary, aflatoun_group_summary, honesty_money_box_group_summary], axis=1)

# Copy the table to create the graph
summary_table_concatenated_graph = summary_table_concatenated.copy()

#summary_table_concatenated

In [9]:
# Calculate difference between Aflatoun and Control Group: means and and standard deviations
# Save result in new columns
summary_table_concatenated[('Difference Aflatoun', 'mean')] = summary_table_concatenated[('Aflatoun', 'mean')] -summary_table_concatenated[('Control Group', 'mean')]
summary_table_concatenated[('Difference Aflatoun', 'std')] = summary_table_concatenated[('Aflatoun', 'std')] - summary_table_concatenated[('Control Group', 'std')]

# Calculate difference between Honesty Money Box and Control Group: means and and standard deviations
# Save result in new columns
summary_table_concatenated[('Difference HMB', 'mean')] = summary_table_concatenated[('HMB', 'mean')] -summary_table_concatenated[('Control Group', 'mean')]
summary_table_concatenated[('Difference HMB', 'std')] = summary_table_concatenated[('HMB', 'std')] - summary_table_concatenated[('Control Group', 'std')]

#summary_table_concatenated

In [10]:
# Drop columns with group Aflatoun and HMB
summary_table_concatenated = summary_table_concatenated.drop(columns=['Aflatoun', 'HMB'])

# Display final summary table
summary_table_concatenated

  summary_table_concatenated = summary_table_concatenated.drop(columns=['Aflatoun', 'HMB'])


Unnamed: 0_level_0,Control Group,Control Group,Difference Aflatoun,Difference Aflatoun,Difference HMB,Difference HMB
Unnamed: 0_level_1,mean,std,mean,std,mean,std
Female,0.5,0.5,-0.003,0.0,-0.02,0.0
Age,12.806,1.989,0.24,0.099,-0.068,0.253
Student Has Money Saved,0.467,0.499,-0.04,-0.004,-0.006,0.0
Amount Saved,5.041,17.966,-0.917,-3.385,-0.286,9.86
Saving Attitudes Index,-0.0,1.0,0.075,-0.041,0.004,-0.017
Home Savings Support Index,-0.0,1.0,-0.015,0.005,0.024,-0.04
Work Index,-0.0,1.0,-0.102,-0.082,0.017,-0.027
Risk Preference Index,-0.0,1.0,0.037,0.028,0.072,0.051
Time Preference Index,-0.0,1.0,0.013,-0.003,0.008,-0.002
Financial Literacy Index,0.0,1.0,0.053,-0.143,0.016,-0.094


### The end of 4 hour limit was reached here

#### I continued a bit more... 

<a id="amount-saved-graph"></a>
### Graph: Difference in amount saved at school at endline [*](#index)

In [11]:
# Keep only relevant rows from the table
rows_to_keep = ['Amount Saved', 'Amount Saved at Endline']
summary_table_concatenated_graph = summary_table_concatenated_graph.loc[rows_to_keep]

# Transpose the table so indepentend variables are displayed as columnds
summary_table_concatenated_graph = summary_table_concatenated_graph.transpose()

# Display the table
summary_table_concatenated_graph


Unnamed: 0,Unnamed: 1,Amount Saved,Amount Saved at Endline
Control Group,mean,5.041,9.121
Control Group,std,17.966,38.553
Aflatoun,mean,4.124,8.441
Aflatoun,std,14.581,31.39
HMB,mean,4.755,7.917
HMB,std,27.826,33.827


In [12]:

# Extract mean values

# Control Group
control_mean_start = summary_table_concatenated_graph.loc[('Control Group', 'mean'), 'Amount Saved']
control_mean_endline = summary_table_concatenated_graph.loc[('Control Group', 'mean'), 'Amount Saved at Endline']

# Aflatoun Group
aflatoun_mean_start = summary_table_concatenated_graph.loc[('Aflatoun', 'mean'), 'Amount Saved']
aflatoun_mean_endline = summary_table_concatenated_graph.loc[('Aflatoun', 'mean'), 'Amount Saved at Endline']

# Honest Money Box Group
hmb_mean_start = summary_table_concatenated_graph.loc[('HMB', 'mean'), 'Amount Saved']
hmb_mean_endline = summary_table_concatenated_graph.loc[('HMB', 'mean'), 'Amount Saved at Endline']

# Create bar graph
fig = go.Figure(data=[
    go.Bar(name='Amount Saved at Start', x=['Control Group', 'Aflatoun', 'HMB'], y=[control_mean_start, aflatoun_mean_start, hmb_mean_start]),
    go.Bar(name='Amount Saved at End', x=['Control Group', 'Aflatoun', 'HMB'], y=[control_mean_endline, aflatoun_mean_endline, hmb_mean_endline])
])

# Update layout
fig.update_layout(
    title='Amount Saved at Start and Endline by Group',
    xaxis_title='Group',
    yaxis_title='Amount Saved',
    barmode='group'
)

# Show graph
fig.show()

<a id="amount-difference-graph"></a>
### Graph: Difference in amount saved between control group and treatment [*](#index)

In [13]:
# Calculate Difference in Amount Saved at Start and Endline
control_amount_saved = control_mean_endline - control_mean_start
aflatoun_amount_saved = aflatoun_mean_endline - aflatoun_mean_start
hmb_amount_saved = hmb_mean_endline - hmb_mean_start

# Create bar graph
fig = go.Figure(data=[
    go.Bar(name='Amount Saved at Start', x=['Control Group', 'Aflatoun', 'HMB'], y=[control_amount_saved, aflatoun_amount_saved, hmb_amount_saved]),
])

fig.update_layout(
    title='Total Amount Saved by Group',
    xaxis_title='Group',
    yaxis_title='Amount Saved (Ghanaian Cedis)',
    yaxis=dict(range=[0, 15]),
    annotations=[
        dict(
            x=0, # 0 is left, 1 is right
            y=1, # 0 is bottom, 1 is top
            xref='paper', # means plotting area
            yref='paper', # means plotting area
            text='1 USD = 15 Ghanaian Cedis',
            xanchor='left',
            yanchor='bottom',
            showarrow=False,
            font=dict(size=12)
        )
    ],
    barmode='group'
)

# Show graph
fig.show()

<a id="regression-model"></a>
### Model: Create a regression model [*](#index)

Example of a Regression Model using scikit-learn library

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [15]:
# Load the dataset
data = pd.read_csv('ghana_youth_savings.csv')


In [16]:
# Determine relevant columns
X = data[['aflatoun', 'hmb', 'treatment','baseline_saveamt', 'age', 'female',]]  # Define Independent variables
y = data['end_saveamt']  # Dependent variable

# Handle missing values if necessary
X = X.fillna(0)
y = y.fillna(0)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [17]:
# Initialize the model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Print the coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Coefficients: [-0.51377938 -0.20379203 -0.71757141  0.32551226  1.64705093 -0.86006079]
Intercept: -12.937064336617532


In [18]:
# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R2):", r2)

Mean Squared Error (MSE): 423.50194981304026
R-squared (R2): 0.06670807241909538


Interpretation of Results (AI Generated)

    Mean Squared Error (MSE):
        Value: 423.50
        Meaning:
        MSE measures the average squared difference between the predicted and actual values. A lower MSE indicates better model performance. In this case, an MSE of 423.50 suggests that, on average, the squared prediction error for the amount saved (end_saveamt) is quite high, meaning the model's predictions are not very close to the actual values.
        However, the interpretation of MSE depends on the scale of the dependent variable (end_saveamt). If the saved amounts range widely (e.g., between 0 and 1000), an MSE of 423 might be reasonable. If not, the model might need improvement.

    R-squared (R²):
        Value: 0.0667
        Meaning:
        R² explains how much of the variability in the dependent variable (end_saveamt) is explained by the independent variables (aflatoun, hmb, etc.).
            A value of 0.0667 means only 6.67% of the variability in end_saveamt is explained by the model.
            This is quite low, suggesting that the model is not capturing the relationship well, and there may be other unobserved factors driving end_saveamt.

<a id="ml-model"></a>
### Model: Describe how you would build a machine learning model [*](#index)


##### Prepare data:

- Encode categorical variables (e.g., gender, tribe, end_save, etc) to take boolean values
- Handle missing values
- Normalize or scale numerical features for regression analysis.

##### Split data:

- Training set
- Validation set
- Test sets

##### Select appropriate model based on problem

- Neural network for classification 
- Neural network for regression

##### Define model parameters

- Loss function to quantify prediction errors
- Optimizer to adjust the model's weights

##### Train model over multiple iterations

##### Evaluate the model performance

- Using unseen data
- Adjust parameters

##### Save for deployment


<a id="advice"></a>
### How to provide an ongoing monitoring system of Aflatoun savings [*](#index)


Q: How would you recommend setting up the data collection and data engineering
to create an ongoing monitoring system for Aflatoun’s savings clubs? Assume that Aflatoun wants weekly
reporting on savings groups participation and savings amounts.

A: Aflatoun could create a web application or mobile application that teachers could use in schools (If internet and computers or cellphones are available). It could have the necessary templates to track students saving deposits etc. It could help users compare how they are doing compared with other schools and could also report the data back to  Aflatoun in a way that it can use the feedback to improve its curriculum. If this is not available, perhaps standardized formats like Optical Mark Recognition, Checkboxes could be filled manually and then be transferred by someone visiting the schools on a periodic basis. 