Kiva ETL

Background Information

Data Source

Kaggle: https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding

Data Files:

One "kiva_loans.csv" file that contains information on several loans on the Kiva platform
One "kiva_mpi_region_locations.csv" file that contains geographical information for the borrower as well as the MPI (Global Multidimensional Poverty Index) for the borrower's location
One "loan_theme_ids.csv" file that contains information on each Kiva loan "theme" or the general purpose behind each loan
One "loan_themes_by_region.csv" file that contains each lender's (field partner's) information

Extract

CSV files are taken from a Kaggle web page and saved in the "Resources" folder
Data is extracted from the four CSV files in the "Resources" folder
Python code to extract data from the CSV files is listed below:

kiva_loans.csv

kiva_loans_csv = "Resources/kiva_loans.csv"
kiva_loans_df = pd.read_csv(kiva_loans_csv)

kiva_mpi_region_locations.csv

regions_csv = "Resources/kiva_mpi_region_locations.csv"
regions_df = pd.read_csv(regions_csv)

loan_theme_ids.csv

theme_csv = "Resources/loan_theme_ids.csv"
theme_df = pd.read_csv(theme_csv)

loan_themes_by_region.csv

field_partner_csv = "Resources/loan_themes_by_region.csv"
field_partner_df = pd.read_csv(field_partner_csv)

Transform

Python is used to create tables using only the relevant columns:

Loans table:

  cleaned_kiva_loans_df = loans_themes_df[["loan_id","funded_amount","loan_amount","activity","loan_sector","purpose","currency","partner_id","posted_time","disbursed_time","funded_time","term_in_months","lender_count","repayment_interval","date","num_fem_borrowers","num_m_borrowers","total_borrowers","location_id","theme_id"]].copy()

Regions table:

  cleaned_regions_df = regions_df[["location_id","iso","country","region","world_region","mpi","lat","lon"]].copy()
  cleaned_regions_df.drop_duplicates(["location_id"],keep="first",inplace=True)

Loan theme table:

  cleaned_theme_df = cleaned_theme_df[["theme_id","theme_type"]].copy()

Field partners table:

  cleaned_field_partner_df = loan_partner_df[["partner_id","field_partner_name","partner_sector"]].copy()
  cleaned_field_partner_df.drop_duplicates(subset=["partner_id"],keep="first",inplace=True)
  cleaned_field_partner_df.dropna(subset=["partner_id"],how="any",inplace=True)

Cleaning

All four dataframes are cleaned using the following commands:
1. "drop_duplicates" to remove all duplicate entries in each dataframe
2. "to_datetime" to convert columns to datetime format that are initially registered as strings
3. "rename" to change column titles so that all column titles are lower case, contain zero spaces, and are overall clearly written
4. "dropna" to drop rows that have N/As in critical columns (e.g. if rows do not have information listed in their primary key figures)
5. "merge" to establish foreign key figures in tables where there is not a foreign key originally listed. The same command is also used to test if two tables can be linked together in a schema
6. "assign" in order to establish primary keys in tables where primary keys are not originally listed

Below is the Python code to clean each of the four dataframes:

Loans dataframe:

# Delete duplicates
kiva_loans_df.drop_duplicates(keep="first",inplace=True)

# Rename "id" column to "loan_id" for clarity 
kiva_loans_df.rename(index=str, columns = {"id":"loan_id","sector":"loan_sector"},inplace=True)

# Convert dates from string to datetime format
kiva_loans_df["posted_time"]=pd.to_datetime(kiva_loans_df["posted_time"])
kiva_loans_df["disbursed_time"]=pd.to_datetime(kiva_loans_df["disbursed_time"])
kiva_loans_df["funded_time"]=pd.to_datetime(kiva_loans_df["funded_time"])
kiva_loans_df["date"]=pd.to_datetime(kiva_loans_df["date"])

# Create separate columns for the number of female, male, and total borrowers
kiva_loans_df["num_fem_borrowers"] = kiva_loans_df["borrower_genders"].str.count("female")
kiva_loans_df["num_m_borrowers"] = (kiva_loans_df["borrower_genders"].str.count("male")) - (kiva_loans_df["num_fem_borrowers"])
kiva_loans_df["total_borrowers"] = kiva_loans_df["num_fem_borrowers"] + kiva_loans_df["num_m_borrowers"]

# Merge the loans and regions DataFrames to make sure we have a way to link them together
loans_regions_df = pd.merge(kiva_loans_df,regions_df,how="left",left_on=["country","region"],right_on=["country","region"])

# Create a "cleaned" DataFrame for Kiva loans with only the columns we need
cleaned_kiva_loans_df = loans_regions_df[["loan_id","funded_amount","loan_amount","activity","loan_sector","use","currency","partner_id","posted_time","disbursed_time","funded_time","term_in_months","lender_count", "repayment_interval","date","num_fem_borrowers","num_m_borrowers","total_borrowers","location_id"]].copy()
cleaned_kiva_loans_df.drop_duplicates(keep="first",inplace=True)

# Re-merge so to re-define the theme_id and ensure it will link
loans_themes_df = pd.merge(cleaned_kiva_loans_df,cleaned_theme_df,how="left",left_on="theme_id_old",right_on="theme_id_old")

# Redefine the cleaned Kiva loans DataFrame to include a theme ID to link two dataframes together 
cleaned_kiva_loans_df = loans_themes_df[["loan_id","funded_amount","loan_amount","activity","loan_sector","purpose","currency","partner_id","posted_time","disbursed_time","funded_time","term_in_months","lender_count","repayment_interval","date","num_fem_borrowers","num_m_borrowers","total_borrowers","location_id","theme_id"]].copy()

Regions dataframe:

# Delete duplicates
regions_df.drop_duplicates(keep="first",inplace=True)

# Create a location ID
regions_df.sort_values(["country","region"], ascending = True, inplace = True)
regions_df=regions_df.assign(location_id=(regions_df["country"]+"_"+regions_df["region"]).astype("category").cat.codes)

# Rename columns with capital letters, spaces, or are otherwise unclear
regions_df.rename(index=str, columns = {"ISO":"iso","LocationName":"location_name","MPI":"mpi","sector":"loan_sector"},inplace=True)

# Create "cleaned_regions_df" with only the relevant columns from the "regions_df" and drop duplicates
cleaned_regions_df = regions_df[["location_id","iso","country","region","world_region","mpi","lat","lon"]].copy()
cleaned_regions_df.drop_duplicates(["location_id"],keep="first",inplace=True)

Loan theme dataframe:

# Rename columns not to have spaces in their titles
theme_df.rename(index=str, columns = {"id":"loan_id","Loan Theme ID":"theme_id_old","Loan Theme Type":"theme_type","Partner     ID":"partner_id"},inplace=True)

# Define "cleaned_theme_df" as only the relevant columns and remove duplicates
cleaned_theme_df = theme_df[["theme_id_old","theme_type"]].copy()
cleaned_theme_df.sort_values("theme_id_old",inplace=True,ascending=True)
cleaned_theme_df.drop_duplicates(["theme_id_old"],keep="first",inplace=True)

# The old theme_id relies on case sensitivity. Recreating the theme_id
cleaned_theme_df = cleaned_theme_df.assign(theme_id=(cleaned_theme_df["theme_id_old"]).astype("category").cat.codes)

Field partners dataframe:

# Rename the columns that have spaces or capital letters
field_partner_df.rename(index=str, columns = {"Partner ID":"partner_id","Field Partner Name":"field_partner_name","Loan Theme     ID":"theme_id","Loan Theme Type":"theme_type","ISO":"iso","LocationName":"location_name","sector":"partner_sector"},inplace=True)
field_partner_df = field_partner_df[["partner_id","field_partner_name","partner_sector"]].copy()
field_partner_df.drop_duplicates(keep="first",inplace=True)

# Merge loans and partner dataframes to make sure there aren't partners not attached to loans
loan_partner_df = pd.merge(cleaned_kiva_loans_df,field_partner_df,how="left",left_on="partner_id",right_on="partner_id")

# Create cleaned dataframe with only the relevant columns and drop all duplicates
cleaned_field_partner_df = loan_partner_df[["partner_id","field_partner_name","partner_sector"]].copy()
cleaned_field_partner_df.drop_duplicates(subset=["partner_id"],keep="first",inplace=True)
cleaned_field_partner_df.dropna(subset=["partner_id"],how="any",inplace=True)

Load

SQL: Create the Schema

Data is organized into a schema of four tables (kiva_loans, loan_regions, loan_themes, and field_partners)

Complete the following in MySQL to create the schema:

Use a SQL script to establish the database:

CREATE DATABASE IF NOT EXISTS kiva_db;
USE kiva_db;
ALTER DATABASE kiva_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS kiva_loans;
DROP TABLE IF EXISTS loan_regions;
DROP TABLE IF EXISTS loan_themes;
DROP TABLE IF EXISTS field_partners;

In the Kiva database, create the four separate tables:

Field partners

CREATE TABLE field_partners(
    partner_id INT,
    field_partner_name VARCHAR(200),
    partner_sector VARCHAR(50),
    PRIMARY KEY(partner_id)
);

Loan themes

CREATE TABLE loan_themes(
    theme_id VARCHAR(100),
    theme_type VARCHAR(100),
    PRIMARY KEY(theme_id)
);

Loan regions

CREATE TABLE loan_regions(
    location_id INT,
    iso VARCHAR(5),
    country VARCHAR(50),
    region VARCHAR(100),
    world_region VARCHAR(50),
    mpi FLOAT,
    lat FLOAT,
    lon FLOAT,
    PRIMARY KEY(location_id)
);

Kiva loans

CREATE TABLE kiva_loans(
    loan_id INT,
    funded_amount FLOAT,
    loan_amount FLOAT,
    activity VARCHAR(100),
    loan_sector VARCHAR(100),
    purpose MEDIUMTEXT,
    currency VARCHAR(5),
    partner_id INT,
    posted_time DATETIME,
    disbursed_time DATETIME,
    funded_time DATETIME,
    term_in_months FLOAT,
    lender_count INT,
    repayment_interval VARCHAR(50),
    date DATE,
    num_fem_borrowers INT,
    num_m_borrowers INT,
    total_borrowers INT,
    location_id INT,
    theme_id VARCHAR(100),
    PRIMARY KEY(loan_id),
    FOREIGN KEY(partner_id) REFERENCES field_partners(partner_id) ON DELETE CASCADE,
    FOREIGN KEY(theme_id) REFERENCES loan_themes(theme_id) ON DELETE CASCADE,
    FOREIGN KEY(location_id) REFERENCES loan_regions(location_id) ON DELETE CASCADE
);

Python: Connect DataFrames to MySQL Database

A MySQL database connection is created and an engine is created in Python (if reproducing, make sure to add in the appropriate computer password):

# Establish the MySQL database connection and create the engine
rds_connection_string = "root:<PASSWORD HERE>@127.0.0.1/kiva_db"
engine = create_engine(f"mysql+pymysql://{rds_connection_string}?charset=utf8", encoding = "utf8")

The engine is used to populate all the SQL tables

# Add the field partner DataFrame to MySQL    
cleaned_field_partner_df.to_sql(name="field_partners",con=engine,if_exists="append",index=False,chunksize=2000)

# Add the theme DataFrame to MySQL
cleaned_theme_df.to_sql(name="loan_themes",con=engine,if_exists="append",index=False)

# Add the regions DataFrame to MySQL
cleaned_regions_df.to_sql(name="loan_regions",con=engine,if_exists="append",index=False,chunksize=2000)

# Add the Kiva loans DataFrame to MySQL
cleaned_kiva_loans_df.to_sql(name="kiva_loans",con=engine,if_exists="append",index=False,chunksize=2000)

Conclusion/Ideas for Future Research

The Kiva database is really comprehensive and could be used to conduct many analyses. The following ideas come to mind:
- Performing an exploratory analysis on the impact of gender on Kiva loans:
  - Do loan recipients tend to be male or female?
  - Do men or women tend to ask for more or less money?
  - Do men and women ask for loans to accomplish similar goals or different goals?
  - Do men and women ask for similar or different repayment intervals?
- Performing an exploratory analysis on regions and Kiva loans:
  - Do wealthier (from the mpi index score) or poorer regions have more Kiva recipents? Do these Kiva recipients ask for more or less money?
  - Do different regions tend to have similar or different financial goals with their Kiva loans?
  - Do Kiva recipients and partners tend to be from the same region or from different regions?
- Performing an exploratory analysis on individual borrowers and group borrowers:
  - Do individual borrowers and group borrowers tend to have the same or different repayment intervals?
  - Do individual borrowers and group borrowers tend to have higher or lower loan amounts?
  - Do individual borrowers and group borrowers tend to have loans for similar or different purposes?
  - Do individual borrowers and group borrowers tend to be from similar or different regions?

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Resources		Resources
.gitattributes		.gitattributes
README.md		README.md
etl_analysis.ipynb		etl_analysis.ipynb
scripts.sql		scripts.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

Resources

Resources

.gitattributes

.gitattributes

README.md

README.md

etl_analysis.ipynb

etl_analysis.ipynb

scripts.sql

scripts.sql

Repository files navigation

Kiva ETL

Background Information

Data Source

Data Files:

Extract

Transform

Cleaning

Load

SQL: Create the Schema

Python: Connect DataFrames to MySQL Database

Conclusion/Ideas for Future Research

About

Releases

Packages

Languages

mileslucey/kiva_etl

Folders and files

Latest commit

History

Repository files navigation

Kiva ETL

Background Information

Data Source

Data Files:

Extract

Transform

Cleaning

Load

SQL: Create the Schema

Python: Connect DataFrames to MySQL Database

Conclusion/Ideas for Future Research

About

Topics

Resources

Stars

Watchers

Forks

Languages