## Part 1: Linear regression

In [1]:
# Import required libraries. No other libraries are required for this task.
import pandas as pd
import numpy as np
from sklearn import linear_model

### Question 1 -  Import CSV into pandas

1. Create a function to read the csv file provided into a DataFrame. 
3. First step in processing data is to list the data types of the columns. 
4. Use **pandas** features *columns* and *dtypes* to create a dictionary with column names as keys and the datatype as values.
5. This function then returns the new dataframe (df) and the df_types dictionary (df_types), where a key-value pair represents column name-column's dtype

In [None]:
def process_data(fl):
    
    # Import the CSV file (fl)
    df = pd.read_csv(fl)
    
    # Create a dictionary with keys the column names and values the type of data
    df_types = {}
    
    for col in df.columns:
        df_types[col] = df[col].dtype
        
    return df, df_types

### Question 2 - Splitting data

1. Split the data into 2 dataframes: called *df_train* and *df_test*. 
2. Use **pandas** DataFrame.sample to pick around 75% randomly as the training dataframe. 
3. Put the rest in test dataframe. Use DataFrame.drop() function on the full dataframe to drop the entries in *df_train*.
4. Do NOT use methods from **sklearn**. 


In [None]:
def train_test_split(df):
    df_train = None
    df_test = None
    
    # Assign 75% of input data(df) to df_train and the rest to df_test
    
    df_train = df.sample(frac=0.75)
    df_test = df.drop(df_train.index)
    
    return df_train, df_test

### Question 3 - Scaling data

1. In the dataframe each column is a feature. In the real estate data there are 8 features. The first column (number 0) is just an index number - ignore it. We will only consider 7 features (1-7). 
2. These are all of different orders of magnitude. For example, the "transaction date" is in thousands (very high value) but the "number of ...stores" is in one or two digits (low). So we scale them to be more consistent, otherwise transaction date could dominate the predicted outcome of the regression model.
3. Find the *maximum* ($M$), *minimum* ($m$) and *mean* ($av$) of each column. Each entry $x_i$ is scaled as:

$$ x_i \rightarrow \frac{x_i -av}{M-m}$$

4. We will apply scaling to the *numpy* arrays. 
5. In the function below the input feature matrix is $X\_in$. 
6. You may create a helper function. Check the numpy functions.

In [4]:
def scale_features(df):
    
    #the feature vectors as a matrix
    X_in = np.array(df.iloc[:,1:7])
    
    #the output vector
    y = np.array(df.iloc[:, 7])
    
    #a matrix of same shape as X_in with all zeros
    X_scaled = np.zeros(X_in.shape)
    pass

    #apply scaling to each column of X_in separately and store them in X_scaled 
    for i in range(X_in.shape[1]):
        X_scaled[:,i] = (X_in[:,i] - X_in[:,i].mean())/(X_in[:,i].max() - X_in[:,i].min())
    
    return X_scaled, y


### Question 4 - Model training

We are now ready to build the linear regression model. 
1. We use the **sklearn** [linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class to build the model. 
2. Answer the questions that follow. 
  

In [None]:
#4 marks
def fit_linearModel(X, y):

    # return the LinearRegression() estimator that has been fitted, so that
    # it can be used for the next question
    
    # Your code goes here
    linmodel_realest = linear_model.LinearRegression().fit(X, y)
    
    return linmodel_realest

In [6]:
# Import data
df, df_types = process_data("Real_estate_data_for part_ONE.csv")

# Train test split
df_train, df_test = train_test_split(df)

# Scale the training data
X_scaled, y = scale_features(df_train)

# Train model
linmodel_realest = fit_linearModel(X_scaled, y)

# Get coeficients
print(linmodel_realest.coef_)

# Get intercept
print(linmodel_realest.intercept_)

[  5.60262607 -11.47694919 -29.32875796  10.40237297  20.61405328
  -1.68558322]
37.486774193548015


In [7]:
# Run the model 5 times with different training sets
for i in range(5):
    
    # Train test split (every repetition gets different datasets)
    df_train, df_test = train_test_split(df)
    
    # Get the training data scaled
    X_scaled, y = scale_features(df_train)
    
    # Train the model 
    linmodel_realest = fit_linearModel(X_scaled, y)
    
    # Get coefficients
    print(linmodel_realest.coef_)
    
    # Get intercept
    print(linmodel_realest.intercept_)

[  5.32358509 -12.97158811 -28.15098444  10.23168245  18.92680948
  -2.15219871]
38.344516129031135
[  4.71406968 -12.02453915 -26.4566264   13.22445595  20.82562611
  -1.83452947]
37.15741935484101
[  4.68987847 -12.01646868 -34.78219122  11.06351945  14.06002904
  -4.6879378 ]
38.219999999999615
[  5.07662423 -12.62305316 -26.7688618   13.57717154  16.1448583
  -0.16292757]
37.2109677419355
[  4.94855113 -10.01230455 -26.31828498  13.03571379  18.87453794
   1.16075681]
38.03677419354847


### Question 5 - Model evaluation

Now we use the test data to check the accuracy of the model. We will use root-mean-square error (RMSE) to test accuracy.

1. RMSE is the square-root of the average of the squared errors between the predicted and observed value. 
2. In the following function you will find the RMSE for the fitted model. 
3. You should use the returned LinearRegression() object that is return by the function *fit_linearModel* above.
4. You should write the RMSE function yourself. Do NOT use **sklearn** *score*() method. However you may use the *predict*() method. 
5. Test for accuracy on 5 different train-test sets and report the average RMSE value. Write a few comments on how to improve accuracy of prediction. 

In [None]:
#X and y correspond to the test data and model is the output of fit_linearModel()
def check_rmse(model, X, y):
    rmse = 0

    # Update the variable rmse
    y_preds = model.predict(X)
    rmse = np.sqrt(sum((y_preds-y)**2)/(y.shape[0]))
    
    return rmse

In [9]:
# Import the data 
df, df_types = process_data("Real_estate.csv")

# Initialize list with scores
scores = []

# Run the model 5 times with different train and test datasets
for i in range(5):
    
    # Train test split (every repetition gets different datasets)
    df_train, df_test = train_test_split(df)
    
    # Scale train data
    X_train_scaled, y_train = scale_features(df_train)
    
    # Scale test data
    X_test_scaled, y_test = scale_features(df_train)
    
    # Fit model
    linmodel_realest = fit_linearModel(X_scaled, y)
    
    # Evaluate model
    rmse = check_rmse(linmodel_realest, X_test_scaled, y_test)
    
    # Save all evaluations 
    scores.append(rmse)

# Print the list of scores
print("Scores: ", scores)

# Print average of scores
print("Mean score: ", np.mean(scores))

Scores:  [9.053993457009698, 8.001288823097477, 8.736516725876907, 8.191975860902698, 8.957096288378471]
Mean score:  8.58817423105305


### Comments on how to improve accuracy

- Try to use different sizes for the train and test data sets.
- Review the independence and correlation within the variables.
- Select only relevant variables for the regression. These are the variables more correlated with the target variable, the variables with less correlation with the target variable can be ignored.
- Try different methods for scale and normalize the data.
