### BUSA3020 - Business Analytics - Assignment 1 {-}


---

### About the Assignment

Credit score cards are used as a risk control method in the financial industry. Personal information submitted by credit card applicants are used to predict the probability of future defaults. The bank employs such data to decide whether to issue a credit card to the applicant or not.




| Feature Name         | Explanation     | Additional Remarks |
|--------------|-----------|-----------|
| ID | Randomly allocated client number      |         |
| Income   | Annual income  |  |
| Gender   | Applicant's Gender   | Male = 0, Female = 1  |
| Car | Car Ownership | Yes = 1, No = 0 | 
| Children | Number of Children | |
| Real Estate | Real Estate Ownership | Yes = 1, No = 0 
| Days Since Birth | No. of Days | Count backwards from current day (0), -1 means yesterday
| Days Employed | No. of Days | Count backwards from current day(0). If positive, it means the person is currently unemployed.
| Payment Default | Whether a client has overdue credit card payments | Yes = 1, No = 0



---


### Problem 1 - (50 points) {-}


**Question 1** 

- Import the `assignment_data.xlsx` file from `data` folder into a pandas DataFrame named `df`; 
- Delete duplicate rows from `df` according to `ID`;
- Delete the `ID` column.
- How many rows are left in `df`?

(10 points)

In [40]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.metrics import accuracy_score

In [41]:
df=pd.read_excel("assignment_data.xlsx")

In [42]:
df = df.drop_duplicates(subset='ID')

In [43]:
df = df.drop(columns='ID')

In [44]:
print(f"Number of rows left in df: {len(df)}")

Number of rows left in df: 5976


---

**Question 2**

- Reset the index in `df` using an appropriate function from `pandas` so that the new index corresponds to the number of rows (make sure to delete the old index). 
- How many positive values of `Days Employed` are there?
- Replace the positive values of `Days Employed` with 0 (zero) in `df`

(10 points)

In [45]:
df.reset_index(drop=True, inplace=True)

In [46]:
positive_days_employed_count = (df['Days Employed'] > 0).sum()
positive_days_employed_count

967

In [47]:
df.loc[df['Days Employed'] > 0, 'Days Employed'] = 0

---
**Question 3**

Create two new variables in `df` named 

1. `Age`;
2. `Years in Employment`,

which measure age and employment length in **years** (decimal numbers) from `Days Since Birth` and `Days Employed` by applying approapriate transformations on these variables. 

Delete the original variables `Days Since Birth` and `Days Employed`.

(5 points)


In [13]:
df['Age']= abs(df['Days Since Birth']/365)
df['Years in Employment']= abs(df['Days Employed']/365)

In [14]:
df= df.drop('Days Since Birth',axis=1)
df=df.drop('Days Employed', axis =1)

---
**Question 4**

- Create a **one**-dimensional NumPy array named `y` by exporting the first 5,000 observations of `Payment_Default`. (Hint: see `ravel()` function)
- Create a NumPy array named `X` by exporting the first 5,000 observations of the following columns `Gender`, `Car`, `Real Estate`, `Children`, `Income`, `Age`, `Years in Employment`.
 
(10 points)


In [15]:
y=df['Payment Default'].head(5000).to_numpy().ravel()

In [16]:
Columns=['Gender','Car','Real Estate','Children','Income','Age','Years in Employment']
X=df[Columns].head(5000).to_numpy()

---

**Question 5** 

- Use an appropriate `scikit-learn` library we learned in class to create the following NumPy arrays: `y_train`, `y_test`, `X_train` and `X_test` by splitting the data into 70% training and 30% test datasets. 
- Set `random_state` to 0 and stratify subsamples so that train and test datasets have roughly equal proportions of the target's class labels. 

(5 points) 

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0, stratify = y)

In [22]:
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (3500, 7)
Shape of X_test: (1500, 7)
Shape of y_train: (3500,)
Shape of y_test: (1500,)


---

**Question 6**

- Create new variables by using an appropriate `scikit-learn` library we learned in class to standardize the features from the training and test datasets to mean zero and variance one. Name the new variables by appending '_scaled' to the original variable names.


(10 points)   

In [25]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=[col + '_scaled' for col in Columns])
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=[col + '_scaled' for col in Columns])

print(f"Shape of X_train_scaled_df: {X_train_scaled_df.shape}")
print(f"Shape of X_test_scaled_df: {X_test_scaled_df.shape}")

print(X_train_scaled_df.head())
print(X_test_scaled_df.head())

Shape of X_train_scaled_df: (3500, 7)
Shape of X_test_scaled_df: (1500, 7)
   Gender_scaled  Car_scaled  Real Estate_scaled  Children_scaled  \
0      -1.368782   -0.776250           -1.357651        -0.558100   
1      -1.368782    1.288245            0.736566         0.681335   
2      -1.368782    1.288245            0.736566        -0.558100   
3       0.730577   -0.776250            0.736566         1.920770   
4       0.730577   -0.776250           -1.357651        -0.558100   

   Income_scaled  Age_scaled  Years in Employment_scaled  
0       3.013142    0.971043                   -0.644484  
1       0.284764   -0.747744                    0.015658  
2      -0.494772    0.386946                   -0.677407  
3      -0.884540   -0.757011                    0.176051  
4      -0.105004   -0.597323                    0.690152  
   Gender_scaled  Car_scaled  Real Estate_scaled  Children_scaled  \
0       0.730577    1.288245            0.736566        -0.558100   
1      -1.368782  

---

## Problem 2 - (20 Points) {-}

**Question 7**

Fit the following two classifiers to the transformed training dataset using `scikit-learn` libraries.

- Perceptron - name your instance `pc` set `random_state=1`
- Logistic Regression - name your instance `lr` set `random_state=1`

When initializing instances of the above classifiers only set the parameters referenced above and nothing else.

(20 points)

In [29]:
pc = Perceptron(random_state=1)
pc.fit(X_train_scaled, y_train)

lr = LogisticRegression(random_state=1)
lr.fit(X_train_scaled, y_train)

LogisticRegression(random_state=1)

---
## Problem 3 - (30 points) {-}


**Question 8**

- Using a method built into each of the two classifiers compute their prediction accuracies on the training data;
- Store the accuracy values into variables named according to the following pattern: `classifier_name_accuracy_train`, e.g. you should have `lr_accuracy_train`; 
- Print the two accuracy **variables** along with their brief descriptions.

(10 points)

In [35]:
pc_accuracy_train = pc.score(X_train_scaled, y_train)

lr_accuracy_train = lr.score(X_train_scaled, y_train)

print(f"Perceptron training accuracy: {pc_accuracy_train}")
print(f"Logistic Regression training accuracy: {lr_accuracy_train}")

Perceptron training accuracy: 0.496
Logistic Regression training accuracy: 0.5577142857142857


---

**Question 9** 

- Using a method built into each of the above classifiers compute their prediction accuracy for the test dataset
- Store the accuracy values into variables named according to the following pattern: `classifier_name_accuracy_test`, e.g. you should have `lr_accuracy_test`. 
- Print the two accuracy **variables** along with brief descriptions.

(10 points)

In [37]:
pc_accuracy_test = pc.score(X_test_scaled, y_test)
lr_accuracy_test = lr.score(X_test_scaled, y_test)

print(f"Perceptron test accuracy: {pc_accuracy_test}")
print(f"Logistic Regression test accuracy: {lr_accuracy_test}")

Perceptron test accuracy: 0.49466666666666664
Logistic Regression test accuracy: 0.5073333333333333


---

**Question 10** 

Using nicely formated text in Markdown comment on the accuracies computed in Questions 8 & 9 making sure you address:
- training and test set datasets; 
- Perceptrion and Logistic Regression models. 

Are the results as expected, and why or why not? (Hint: You are not expected to comment on why a particular model is better.) 

(10 marks)


|  Models            |  Training| Test |
| :---------------- | :------: | ----: |
| Perceptrion        |   0.496   | 0.558 |
| Logistic Regression model |   0.494   | 0.507 |


### Analysis Output:
The accuracies for the Perceptron and Logistic Regression Models on both the training and test datasets provides insights into the performance and generalisation capabilities of each model.

#### Perceptron Model
- **Training Accuracy: 49.6%** - indicates the Perceptron model is performing slightly better than random guessing (which would be 50%)
- **Test Accuracy: 49.47%** - suggests that the model struggles to generalise to unseen data and might be underfitting, as its performance is almost equivalent to random guessing.




#### Logistic Regression Model
- **Training Accuracy: 55.77%** - indicates the Logistic Regression Model fits the training data better than the Perceptron model. 
- **Test Accuracy: 50.73%** - indicates a slight drop in performance when generalising to new data, but it still performs only marginally better than random guessing. 


#### Insights
Both models show similar performance on the test data, which is close to 50%, indicating they struggle to make good predictions. 

The Logistic Regression model performs slightly better than the Perceptron on the training data, but neither model shows strong performance on the test data. This suggests that the data might not be well-suited to linear models, and we might need to use other models or features to improve predictions. 

In summary, the models are not performing well, especially on the new data. It indicates a need for further improvements in the modeling approach. 