# Project 1

Our challenges:

1. Get target files from a directory
2. Read those files into our program and concatenate them into one
3. Apply regex rules to extract certain information from text data
4. Create a new file with extracted data as additional columns


## Step by step

1. Get target files from a directory (e.g., any file that has prefix "20")
   - use "os" module to access files in our computer
   - wrap with a function

In [18]:
import os  # python os module to help you to access the filesystem

files = os.listdir("../resources")

target_files = []
for file in files:
    if file.startswith("20"):
        target_files.append(file)
        

In [19]:
# now change to use list comprehension
target_files = [file for file in files if file.startswith("20")]

In [20]:
# step 1 
def get_target_files(prefix):
    return [file for file in os.listdir("../resources/") if file.startswith(prefix)]

2. Read in those files and concatenate them into one
   - we will use pandas for its easier operation
   - wrap with a function

In [29]:
import pandas as pd  # let's use pandas


In [30]:
df_list = []

for file in target_files:
    df = pd.read_csv("../resources/" + file)
    df_list.append(df)

In [31]:
# use pd.concat() to concatenate a list of dataframes
concat_df = pd.concat(df_list)

In [32]:
# this time use list comprehension to achieve the same
df_list = [pd.read_csv("../resources/" + file) for file in target_files]
concat_df = pd.concat(df_list)

In [33]:
# step 2
def concat_df(file_list):
    df_list = [pd.read_csv("../resources/" + file) for file in file_list]
    return pd.concat(df_list)

In [34]:
concat_df(target_files)

Unnamed: 0,loan_id,customer_id,amount,duration,payments,status,year,month,day,date,fulldate,location,purpose,comments
0,5221,A00005700,52512,12,4376,C,2018,12,5,2018-12-05,2018-12-05,64,debt_consolidation,interest rate: 3.78%; monthly payment: 204
1,5346,A00010068,55632,24,2318,C,2018,12,6,2018-12-06,2018-12-06,64,home_improvement,Interest rate: 3.85%; Monthly payment: 109
2,6402,A00002334,139488,24,5812,C,2018,12,6,2018-12-06,2018-12-06,1,home,Interest rate:3.52%; Monthly payment:278
3,5027,A00003678,160920,36,4470,C,2018,12,2,2018-12-02,2018-12-02,1,home,interest rate:3.05%; monthly payment:281
4,6856,A00008321,163332,36,4537,C,2018,12,1,2018-12-01,2018-12-01,1,home,interest rate:3.38%; Monthly payment:662
5,5428,A00002622,230220,36,6395,C,2018,12,2,2018-12-02,2018-12-02,59,home,interest rate:3.09%; monthly payment:750
6,6748,A00006265,240900,60,4015,C,2018,12,8,2018-12-08,2018-12-08,1,home,Interest rate:3.59%; Monthly payment:$250
7,4989,A00001772,352704,48,7348,C,2018,12,5,2018-12-05,2018-12-05,1,home,interest rate:3.99%; monthly payment:$219
0,5576,A00002350,31248,12,2604,A,2014,12,21,2014-12-21,2014-12-21,18,debt_consolidation,Interest rate:3.88%; monthly payment:810
1,6306,A00007819,36684,12,3057,A,2014,12,20,2014-12-20,2014-12-20,1,debt_consolidation,interest rate:3.92%; Monthly payment:$395


3. Apply regex rules to extract certain information from text data
   - check the target string for monthly payment
   - write regex pattern
   - handle potential error
   - wrap with a function

In [36]:
example_str = "interest rate: 3.35%; monthly payment: $356"

In [37]:
import re


In [38]:
payment_regex = "monthly payment:\s*\$?(\d+)"
match = re.search(payment_regex, example_str)
match.group(1)

'356'

In [40]:
payment_regex = "monthly payment:\s*\$?(\d+)"
match = re.search(payment_regex, "some bad string")  # when there is no match, we get nothing (None) back
match.group(1)  # this will cause error if no match 

AttributeError: 'NoneType' object has no attribute 'group'

In [41]:
# step 3
def get_payment(string):
    payment_regex = "monthly payment:\s*\$?(\d+)"
    match = re.search(payment_regex, string, re.I)
    try:
        payment = float(match.group(1))
        return payment
    except AttributeError:
        print(f"No match for string: {string}")
        return None

4. put together a main function
   - use for these component functions
   - use pandas "apply" 

In [42]:
# step 4
def main():
    files = get_target_files("20")
    combined_df = concat_df(files)
    combined_df['monthly_payment'] = combined_df['comments'].apply(get_payment)
    combined_df.to_csv("result.csv", index=False)
    print("Done!")

In [43]:
main()

No match for string: na
No match for string: na
No match for string: na
No match for string: na
