## Import package and data

### Note: I use read_excel here, because my '.csv' file doesn't work. You can change the code to 'pd.read_csv' if '.csv' file works on your computer. And it is better way, since 'read_csv' is quicker. 

In [1]:
import pandas as pd
df = pd.read_excel("LoanStats3a.xlsx")

## Data manipulation

### Select useful columns

In [2]:
df=df[['loan_amnt','term','sub_grade','emp_length','home_ownership','verification_status','purpose','loan_status']]

### Drop all records with null values

In [3]:
df=df.dropna()

### Select all records with a loan status in 'Fully Paid' and 'Charged Off'

In [4]:
df=df[df.loan_status.isin(['Fully Paid','Charged Off'])]

### Drop all records where the employment length is not available

In [5]:
df=df[df.emp_length!='n/a']

### Convert 'term' into number, and call it 'term'

In [6]:
df['term']=df.term.apply(lambda x: int(x.split()[0]))

### Convert 'sub_grade' into a number, and call it 'gradeencoding'

In [7]:
grades=['G','F','E','D','C','B','A']
df['gradeencoding']=df['sub_grade'].apply(lambda x: grades.index(x[0])+(0.7-0.1*float(x[1])))

### Convert 'empllengthprocess' into number, and call it 'emplen'

In [8]:
def empllengthprocess(x):
    x=x.split('year')[0]
    if('+') in x:
        return 12
    if ('<') in x:
        return 0
    else:
        return int(x)
df['emplen']=df.emp_length.apply(lambda x: empllengthprocess(x))

### Final dataset

In [10]:
df=df[['loan_amnt','term','verification_status','gradeencoding','emplen','purpose','home_ownership', 'loan_status']]

### Save the new dataset to 'Loans_processed.csv'

In [11]:
df.to_csv('Loans_processed.csv',index=False)