# Introduction to data science - Assignment #3
The ‘crime.csv’, ‘kidney disease.csv’, and ‘email.csv’ data files attached to the assignment file (available at Sakai) is taken from the UCI repository [1] and Kaggle website [2, 3]. The Crime Data file reports the number of violent crimes per 100,000 population for the communities within the United States. It also includes some socio-economic factors. The variables are as follows:
* ‘PctPopUnderPov’: Percentage of people under the poverty level (numeric: from 0 to 1)
* ‘PctUnemployed’: Percentage of unemployed people (numeric: from 0 to 1)
* ‘PolicPerPop’: Ratio of police officers to the population (numeric: from 0 to 1)
* ‘Pcthomeless’: Percentage of homeless people (numeric: from 0 to 1)
* ‘PctBSorMore’: Percentage of people with a bachelor’s degree or higher education (numeric: from 0 to 1)
* ‘ViolentCrimesPerPop’: Ratio of violent crimes to the population (numeric: from 0 to 1)

The Kidney Disease Data reports the age, blood pressure, and the results of the blood test factors for healthy people and kidney patients. This data includes the following 9 variables:
* ‘age’: Age of the individual (numeric: from 2 to 90)
* ‘bp’: Blood pressure (numeric: from 50 to 180)
* ‘sod’: Blood sodium level test result (numeric: from 104 to 163)
* ‘pot’: Blood potassium level test result (numeric: from 2.5 to 47)
* ‘hemo’: Hemoglobin blood test result (numeric: from 3.1 to 17.8)
* ‘pcv’: Packed cell volume test result (numeric: from 9 to 54)
* ‘wc’: White blood cell test result (numeric: from 2200 to 26400)
* ‘rc’: Red blood cell test result (numeric: from 2.1 to 8)
* ‘CKD’: Chronic kidney disease (binary: 0 for healthy individuals and 1 for kidney patients)

The Email Data includes the text of the numerous emails labeled as spam or not spam. This data includes the following columns:
* ‘email’: Text of the email (string)
* ‘label’: Label of the email (binary: 1 for spam and 0 for not-spam)

**Note.** You must put the CSV files in the same folder as your code file. If you use Jupyter notebook it should be in the address: ‘C:/Users/YOURUSER-NAME’. You can also read the file by its address; for example:
```Python
f = open('C:/files/sample-file.txt')
```

## Question 1
Write a code to learn a simple regression model to predict the ratio of violent crimes based on (i) percentage of unemployed people and (ii) percentage of people with a bachelor’s degree or higher education. Then explain the impact of each of these two factors on violent crimes by interpreting the 
regression coefficients.

In [1]:
# Importing packages
from sklearn.linear_model import LinearRegression
import pandas as pd

# Reading data
data = pd.read_csv('crime.csv')

# Assigning features to variables
violentCrimesPerPop = data['ViolentCrimesPerPop']
pctUnemployed = data[['PctUnemployed']]
pctBSorMore = data[['PctBSorMore']]

# Fitting data to linear regression model
model1 = LinearRegression().fit(pctUnemployed, violentCrimesPerPop)
print("For pctUnemployed:")
print("The linear regression coefficient is", model1.coef_) # print linear regression coefficient
print("The linear regression intercept is", model1.intercept_,"\n") # print linear regression intercept

model2 = LinearRegression().fit(pctBSorMore, violentCrimesPerPop)
print("For pctBSorMore:")
print("The linear regression coefficient is", model2.coef_) # print linear regression coefficient
print("The linear regression intercept is", model2.intercept_) # print linear regression intercept

For pctUnemployed:
The linear regression coefficient is [0.79416959]
The linear regression intercept is 0.08767372551145558 

For pctBSorMore:
The linear regression coefficient is [-0.4617803]
The linear regression intercept is 0.6066504351459001


## Question 2
Write a code to learn a multiple regression model to predict the ratio of violent crimes based on all the other variables. Report the most influential factor in violent crimes.

In [2]:
# Importing packages
from sklearn.linear_model import LinearRegression
import pandas as pd

# Reading data & assigning features
data = pd.read_csv('crime.csv')
X = data.iloc[:,0:5] #independent, predictors
y = data['ViolentCrimesPerPop'] #dependent, predicted

# Fitting data to linear regression model
model = LinearRegression().fit(X, y)
print("The linear regression coefficients are", model.coef_)
print("The linear regression intercept is", model.intercept_)


The linear regression coefficients are [ 0.56906083  0.21041265  0.19527306  0.21295658 -0.06307632]
The linear regression intercept is 0.05026064752166082


## Question 3
Use `LogisticRegression` class of `sklearn` package to learn a logistic regression model that predicts chronic kidney disease based on other variables in Kidney Disease Data.

### Non-scaled Data
Using non-scaled data produces a warning message

In [3]:
# Importing packages
from sklearn.linear_model import LogisticRegression
import pandas as pd


# Reading data & assigning features
data = pd.read_csv('kidney_disease.csv')
X = data.iloc[:,0:8] #independent, predictors
y = data['ckd'] #dependent, predicted

# Fitting data to linear regression model
model = LogisticRegression().fit(X,y)
print(model.coef_)
print(model.intercept_)


[[ 5.94233006e-02  1.73231100e-01  6.83825724e-02 -7.04750167e-02
  -1.08041713e+00 -3.34047211e-01  4.49913239e-04 -3.46605088e-01]]
[0.03332329]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Scaled data
Using scaled data avoids such a warning message

In [21]:
# Importing packages
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd


# Reading data & assigning features
data = pd.read_csv('kidney_disease.csv')
y = data['ckd']
#print(data)

# Scale & transform data
data = StandardScaler().fit_transform(data)
#print(data)

X = data[:,0:8] #independent, predictors

#print(X)
#print(y)
# Fitting data to linear regression model
model = LogisticRegression().fit(X,y)
print(model.coef_)
print(model.intercept_)

  #  list(data[:,0]), #'PctPopUnderPov' (first column) as x
  #  list(data[:,1]) #'ViolentCrimesPerPop' (second column) as y

[[ 0.64074792  0.97968573 -1.17806381 -0.29715299 -1.95532593 -1.69544651
   0.84098867 -0.76690007]]
[-0.0579012]


## Question 4
Split Kidney Disease Data into two parts of training data (70%) and testing data (30%), and train the model in Question 3 using the training data. Then predict the chronic kidney disease for the testing data samples and report the accuracy and f1 score of the predictions.

In [4]:
# Importing packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import pandas as pd


# Reading data & assigning features
data = pd.read_csv('kidney_disease.csv')
X = data.iloc[:,0:8] #independent, predictors
y = data['ckd'] #dependent, predicted

# Splitting data into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=123)

# Fitting split data to linear regression model
model = LogisticRegression().fit(X_train, y_train)
print(model.coef_)
print(model.intercept_,"\n")

# Model evaluation (finding accuracy & f1 score)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred)) 
print(f1_score(y_test, y_pred))

[[ 1.11860181e-01  1.96943751e-01 -1.23387851e-02  5.08324523e-02
  -9.16030983e-01 -2.25580580e-01  3.98043961e-04 -4.36055811e-01]]
[0.03345608] 

0.9411764705882353
0.9130434782608695


## Question 5
The following function takes a text as input and returns a dictionary that includes the frequency of each word in the text. Change this function to return the frequency ratio of the most frequent word to the length of the text.

```Python
def get_frequency(input_string):
    
    list_of_words = input_string.split(' ')
    dict_of_frequencies = {}
    
    for word in list_of_words:
        
        if word in dict_of_frequencies.keys():
            dict_of_frequencies[word] = dict_of_frequencies[word] + 1
        else:
            dict_of_frequencies[word] = 1
        
    return(dict_of_frequencies)

```

### Notes

Get dict key with max value in python
https://datagy.io/python-get-dictionary-key-with-max-value/#:~:text=The%20simplest%20way%20to%20get,maximum%20value%20of%20any%20iterable.

`max()` returns max value, with "key=" argument specifying a calculation to be done to which max() will sort by
https://docs.python.org/3/library/functions.html#max
https://docs.python.org/3/library/stdtypes.html#list.sort

`get()` returns the value of a given dict key
https://www.w3schools.com/python/ref_dictionary_get.asp

in this case, `get()` is used as a key for `max()`, so `max()` will sort/find max by the dict value (and not the dict key)


***filter a dictionary by dict comprehension; similar to the max value sol'n above** 
https://thispointer.com/python-filter-a-dictionary-by-conditions-on-keys-or-values/

filter()
https://www.w3schools.com/python/ref_func_filter.asp

In [5]:
#Defining function
def get_frequency(input_string):
    
    list_of_words = input_string.split(' ') #create list of words, delimited by spaces
    dict_of_frequencies = {} #create empty dict of frequencies
    word_count = 0 #create word_count variable, set to 0
    
    # finding frequency of each word in list
    for word in list_of_words: #for each word in list
        
        word_count = word_count + 1 #add to word count for each word in list
        
        if word in dict_of_frequencies.keys(): #if word already exists in dict
            dict_of_frequencies[word] = dict_of_frequencies[word] + 1 #add 1 to word's frequency
        else:
            dict_of_frequencies[word] = 1 #otherwise add word to dict, set frequency to 1
    
    # find frequency ratio of most frequent word(s)
    dict_of_frequencies = { #write over dict_of_frequencies
        key:(value/word_count) #divide frequency by word count (e.g. find ratio)
        for (key,value) in dict_of_frequencies.items() #only for dict elements...
        if value == max(dict_of_frequencies.values()) #... which have frequency equal to the max frequency
    }
    
    return(dict_of_frequencies)

#Testing function
# 12 words total, 'world' and 'test' occur 3 times (proportion = 0.25) each
get_frequency("Hello hello Hello world world world! world Hello! World! test test test")

{'world': 0.25, 'test': 0.25}

In [6]:
# Simplified function for question 6

#Defining function
def get_frequency(input_string):
    
    list_of_words = input_string.split(' ') #create list of words, delimited by spaces
    dict_of_frequencies = {} #create empty dict of frequencies
    word_count = 0 #create word_count variable, set to 0
    
    # finding frequency of each word in list
    for word in list_of_words: #for each word in list
        
        word_count = word_count + 1 #add to word count for each word in list
        
        if word in dict_of_frequencies.keys(): #if word already exists in dict
            dict_of_frequencies[word] = dict_of_frequencies[word] + 1 #add 1 to word's frequency
        else:
            dict_of_frequencies[word] = 1 #otherwise add word to dict, set frequency to 1
    
    return( 
        max(dict_of_frequencies.values())/word_count # return frequency ratio of most frequent word(s)
    )

#Testing function
# 12 words total, 'world' and 'test' occur 3 times (ratio = 3/12 = 0.25) each
get_frequency("Hello hello Hello world world world! world Hello! World! test test test")


0.25

## Question 6
The following code is to extract useful features from the email texts included in the Email Data and train a model to predict if the email is spam. Complete the code to:
* Extract four binary features representing the presence of the words 'hyperlink', 'free', 'click', and 'business' in email texts,
* Use the `get_frequency` function in Question 5 to extract the ratio of the most frequent word of the text as a numeric feature,
* Train a logistic regression model on 70% of the data to classify the email as spam or not spam based on the five extracted features,
* Predict the label of the remaining 30% of the data and report the accuracy of the predictions.

```Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = pd.read_csv('email.csv')

# adding empty columns
data['hyperlink'] = None
data['free'] = None
data['click'] = None
data['business'] = None
data['frequency'] = None

################## your code here ###################
## you need to
## 1. for each row
## 1-1. check if the mail text includes the words
## 'hyperlink', 'free', click', and 'business' and
## fill the corresponding columns with 0 or 1
## 1-2. Use the get_frequency function to get the ratio of
## the most frequent word and fill the frequency column
##
## 2. split the data into the training (70%) and testing
## (30%) data
##
## 3. Use Logistic Regression class of sklearn package
## to train a model to predict the label of emails
## based on the extracted features


#######################################################
```

### Notes
if function in pd dataframe
https://datatofish.com/if-condition-in-pandas-dataframe/
https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function-o

lambda functions
https://www.w3schools.com/python/python_lambda.asp

pandas dataframe.apply()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

In [7]:
# Importing packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Reading data
data = pd.read_csv('email.csv')

# Assign columns based on words in email
listOfWords = ('hyperlink','free','click','business','frequency') # set list of words to check for
for word in listOfWords: # for each word in the list
    data[word] = data['email'].apply( # create a column for the word, relative to 'email' column
        lambda email: # where for each email (row)
        1 if word in email #return 1 if the word is in email
        else 0 #rturn to 0
    )

# Define function to determine frequency ratio
def get_frequency(input_string):
    
    list_of_words = input_string.split(' ') #create list of words, delimited by spaces
    dict_of_frequencies = {} #create empty dict of frequencies
    word_count = 0 #create word_count variable, set to 0
    
    # finding frequency of each word in list
    for word in list_of_words: #for each word in list
        
        word_count = word_count + 1 #add to word count for each word in list
        
        if word in dict_of_frequencies.keys(): #if word already exists in dict
            dict_of_frequencies[word] = dict_of_frequencies[word] + 1 #add 1 to word's frequency
        else:
            dict_of_frequencies[word] = 1 #otherwise add word to dict, set frequency to 1
    
    return( 
        max(dict_of_frequencies.values())/word_count # return frequency ratio of most frequent word(s)
    )

# Assign frequency column based on ratio
data['frequency'] = data['email'].apply(
    lambda email: get_frequency(email)
)

# Assign features to independent/dependent
X = data.iloc[:, 2:7] #independent, predictors
y = data['label'] #dependent, predicted

# Splitting data into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, )

# Fitting split data to linear regression model
model = LogisticRegression().fit(X_train, y_train)
print(model.coef_)
print(model.intercept_,"\n")

# Model evaluation (finding accuracy & f1 score)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred)) 
print(f1_score(y_test, y_pred))


[[ 4.39106835  1.33194665  2.91680277  1.68421113 -0.09431901]]
[-3.03320865] 

0.9122222222222223
0.6721991701244813


## References
1. Communities and crime data set. https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime.
2. Chronic kidney disease dataset. https://www.kaggle.com/datasets/mansoordaku/ckdisease?resource=download.
3. Spam or not spam dataset. https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset