## Tutorial 2: 

1. Open Your Terminal or Command Prompt
2. Activate Your Conda Environment
   conda activate MLLab
3. Install Matplotlib
   conda install matplotlib
4. Verify Installation
5.  import matplotlib.pyplot as plt
6.  
    print(plt.matplotlib.__version__)

After installation, it's a good practice to verify that `matplotlib` has been successfully installed. 

This line of code accesses the matplotlib package through the plt object and then prints the version of matplotlib using the __version__ attribute. 
This command imports `matplotlib` and prints its version, confirming the installation was successful.

7. Install statsmodel using:
 pip install statsmodels
 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print(plt.matplotlib.__version__)
#%matplotlib inline

Pandas is built on the Numpy package and its key data structure is called the DataFrame. 
DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

Let's create a DataFrame from a dictionary. Here, each key becomes a column in the DataFrame, and the values are the data entries for those columns. A dictionary is a built-in data type that stores collections of data as key-value pairs.

In [None]:
data_example = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

df = pd.DataFrame(data_example)


display the data frame

In [None]:
df


#### Select Data
Select 'Name' and 'Age' Columns


In [None]:
df[['Name', 'Age']]


#### Compute Basic Statistics
Compute basic statistics for the numerical columns:

In [None]:
df.describe()


#### Query the DataFrame
Query the DataFrame for people older than 30:

In [None]:
df[df['Age'] > 30]


This example demonstrates the ease of data manipulation with pandas, including creating a DataFrame, selecting specific columns, computing statistics, and filtering data based on conditions. Pandas is an extensive library, supporting a wide range of data manipulation and analysis tasks

#### Communities and Crime Dataset

The dataset used is Communities and Crime data from https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime. The attribute to be predicted is (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.

#### attributes.csv contains the column names/names of variables 

In [1]:
attrib = pd.read_csv('/Users/sitani/Documents/HertieSchool/Tutorials/communitiesandcrime/attributes.csv', delim_whitespace = True)

NameError: name 'pd' is not defined

delim_whitespace=True is effectively splitting each line into separate columns based on whitespaces.

Read the communities.data CSV file into a pandas DataFrame and use attrib Dataframe to define the column names of this new DataFrame.

In [None]:
data = pd.read_csv('/Users/sitani/Documents/HertieSchool/Tutorials/communitiesandcrime/communities.data', names = attrib['attributes'])

In [None]:
print(data.shape)

This data has 1994 samples and 128 features/variable names

In [None]:
data.head()

#### Remove non-predictive features

1. state: US state (by number) - not counted as predictive above, but if considered, should be considered nominal (nominal)

2. county: numeric code for county - not predictive, and many missing values (numeric)

3. community: numeric code for community - not predictive and many missing values (numeric)

4. communityname: community name - not predictive - for information only (string)

5. fold: fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric)

In [None]:
data = data.drop(columns=['state','county',
                          'community','communityname',
                          'fold'], axis=1)
data.head()

In [None]:
print(data.shape)

Now the data has 123 features

#### Checking for Missing Data
Marking Missing values in the dataset from ? to NaN

In [None]:
data = data.replace('?', np.nan)
data.head()


In [None]:
feat_miss = data.columns[data.isnull().any()]

print(feat_miss)
feat_miss.shape

##### From 122 predictive features, 23 contain missing values.

In [None]:
# Look at the features with missing values

data[feat_miss[0:13]].describe()

In [None]:
data[feat_miss[13:23]].describe()

#### Only OtherperCap has 1 missing value, rest have a lot of missing values.
The missing value in feature OtherPerCap will be filled by a mean value using SimpleImputer class from sklearn.

The others features present many missing values, and just for simplicity’s sake, we will remove them from the data set.

The SimpleImputer class provides basic strategies for imputing (filling in) missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.

For more info: https://scikit-learn.org/stable/modules/impute.html

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
# Create an instance of SimpleImputer with mean strategy
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on the data and transform the column in one step
data['OtherPerCap'] = imputer.fit_transform(data[['OtherPerCap']])


In [None]:
data = data.dropna(axis=1)
print(data.shape)
data.head()

Now, the data has 101 features.

### Summary Statistics of a Dataset:


count: The number of non-missing (non-NaN) values.

mean: The mean of the values.

std: The standard deviation of the values.

min: The minimum value.

25%: The 25th percentile (first quartile).

50% (median): The median of the data.

75%: The 75th percentile (third quartile).

max: The maximum value.

In [None]:
data.describe()

 To specify percentiles other than the default values 25, 50, 75..

In [None]:
custom_percentiles = data.describe(percentiles=[.20, .40, .60, .80])
print(custom_percentiles)

In [None]:
import matplotlib.pyplot as plt
# https://seaborn.pydata.org/
# https://matplotlib.org/
# ViolentCrimesPerPop is the output variable in the dataset
plt.hist(data['ViolentCrimesPerPop'], bins=30, alpha=0.7, color='red')
plt.title('Distribution of Per Capita Violent Crimes')
plt.xlabel('Violent Crimes per 100,000 People')
plt.ylabel('Frequency')
plt.show()


### Splitting the data into training and test sets

In [None]:
X = data.iloc[:, 0:100].values #(data)
y = data.iloc[:, 100].values  #(the attribute/feature to be predicted)

The .iloc attribute in pandas is a powerful indexing method used for integer-location based indexing.
.iloc is part of pandas' and provides a way to access a subset of the data frame's rows and columns.

1. Importing train_test_split Function: The **from sklearn.model_selection import train_test_split** command imports the train_test_split function from scikit-learn,    which is used to split the dataset into training and test sets.

2. Setting the Random Seed: **seed = 0** sets the seed for the random number generator to 0. This ensures that the results are reproducible; anyone running this code with the same dataset and seed will get the same split of data.

3. Splitting the Dataset: **X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = seed)** splits the features (X) and the target variable (y) into training and test sets. 30% (test_size = 0.3) of the data is allocated to the test set, while the remaining 70% is used for training the model.
The random_state = seed parameter ensures that the split is reproducible.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
seed = 0

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = seed)

print(X.shape)
print(y.shape)

### Standardization:
Standardization refers to the process of transforming each feature in your data so that it has a mean of 0 and a standard deviation of 1. This is done by subtracting the mean of each feature and then dividing by the standard deviation for each feature. The formula used is:

z=(x-μ)/σ
Here, x is the original feature value
μ is the mean of the feature, and 
σ is the standard deviation of the feature.



1. Importing StandardScaler:**from sklearn.preprocessing import StandardScaler** imports the StandardScaler class, which provides the functionality to standardize features.

2. Creating a StandardScaler Instance: **sc = StandardScaler()** creates an instance of StandardScaler. This instance will be used to compute the mean and standard deviation for each feature in the dataset.
   
3. Fitting and Transforming the Training Data: **X_train = sc.fit_transform(X_train)** computes the mean and standard deviation of each feature in the training set X_train, and then standardizes the training set by applying the transformation z=(x-μ)/σ.

  ​The fit_transform method is a combination of fit (to compute the scaling parameters) and transform (to apply the standardization).    The standardized training data is then reassigned to X_train.

4. Transforming the Testing Data: **X_test = sc.transform(X_test)** applies the same transformation to X_test using the mean and standard deviation calculated from the training set. It's crucial to use the parameters from the training set to ensure the model evaluates on the same scale. The standardized test data is reassigned to X_test.


In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize features by removing the mean and scaling to unit variance

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train)
print(X_test)

Calculate Coefficients of Ordinary Least Squares Regression:
1. Define a function to estimate the coefficients performing the matrix operations for the closed form solution. 
2. using stasmodels
3. Compare the coefficients calculated from the two methods
4. also calculate the predictions from the two methods and find nmse between the actual value of the output variable and your predictions

### Ordinary Least Squares Regression using the closed form estimate:

Let's define a function to do this. 

In [None]:
def linear_regression_closed_form(X_train, y_train, X_test=None):
    """
    Performs linear regression using the closed-form solution.

    Parameters:
    - X_train: Training features, numpy array of shape (n_samples, n_features)
    - y_train: Training target, numpy array of shape (n_samples,)
    - X_test: Optional, test features, numpy array of shape (n_samples_test, n_features)

    Returns:
    - beta: Coefficients estimated from the training data
    - predictions: Optional, predictions made on the test data if X_test is provided
    """
    
    
    # Add intercept term to training and optionally to test data
    X_train_with_intercept = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
    if X_test is not None:
        X_test_with_intercept = np.hstack([np.ones((X_test.shape[0], 1)), X_test])
    
    # Calculate coefficients using the closed-form solution
    beta = np.linalg.inv(X_train_with_intercept.T @ X_train_with_intercept) @ X_train_with_intercept.T @ y_train # 
    
    # Make predictions on the test set if provided
    predictions = None
    if X_test is not None:
        predictions = X_test_with_intercept @ beta
    
    return beta, predictions


In [None]:

beta, predictions_closedform = linear_regression_closed_form(X_train, y_train, X_test)

print("Estimated coefficients:", beta)
print(beta.shape)

In [None]:
if predictions_closedform is not None:
    print("Predictions on test set using closed form solution:", predictions_closedform)

In [None]:
# # Add intercept term for closed-form solution
# X_train_with_intercept = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
# X_test_with_intercept = np.hstack([np.ones((X_test.shape[0], 1)), X_test])


In [None]:
# # Closed-form solution
# beta = np.linalg.inv(X_train_with_intercept.T @ X_train_with_intercept) @ X_train_with_intercept.T @ y_train

### Calculate Coefficients using Statsmodels

In [None]:
import statsmodels.api as sm

In [None]:
# Add intercept term for statsmodels
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

# Fit the model
model = sm.OLS(y_train, X_train_sm).fit()


Make Predictions on Test Set

In [None]:
# Predictions using statsmodels
predictions_statsmodels = model.predict(X_test_sm)

Calculate MSE for Both Models


In [None]:
from sklearn.metrics import mean_squared_error

mse_numpy = mean_squared_error(y_test, predictions_closedform)
mse_statsmodels = mean_squared_error(y_test, predictions_statsmodels)

print("MSE for closed-form solution:", mse_numpy)
print("MSE for statsmodels:", mse_statsmodels)


Compare the Coefficients

In [None]:
print("Coefficients from closed-form solution:", beta)
print("Coefficients from statsmodels:", model.params)


In [None]:
#  beta is from the closed-form solution and coefficients_sm is from statsmodels
difference = beta - model.params  # Make sure both are numpy arrays for direct subtraction
print("Coefficient Differences:", difference)

### Visual Comparison

 Line Plot: For a comparison of how each coefficient from the two models aligns, a line plot can be effective.

 plt: we imported matplotlib.pyplot as plt in the beginning. Remember?
 

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(beta, label='Closed-Form', marker='o')
plt.plot(model.params, label='Statsmodels', marker='x')
plt.ylabel('Coefficient Value')
plt.xlabel('Coefficient Index')
plt.title('Comparison of Coefficients')
plt.legend()
plt.grid(True)
plt.show()
