### GreenDS

# Fundamentals of Agro-Environmental Data Science

### Introduction

The purpose of this Jupyter Notebook exercise is to demonstrate the workflow in a data science project using this IDE. We will use an example on healthcare that is very much publicised in the data science community.

It is not expected that you can understand the code that is being executed, or the algorithms applied. You will be able to do it by the end of this first master's year. But you should be able to identify, at the high level, the different steps on the data science process:
- Formulate the question
- Obtain and prepare data
- Perform data exploration and analysis
- Develop the model
- Present the results

Another goal is to understand the interface of Jupyter Notebook, how to run code and render markdown cells, and solve problems you may find with unexisting libraries.

# Examples of Data Science Use Cases

## Data Science in Healthcare - Predicting Breats Cancer

This example is based on the following Datacamp [Use Case](https://www.datacamp.com/blog/data-science-use-cases-guide) and Kaggle [example](https://www.kaggle.com/code/vincentlugat/breast-cancer-analysis-and-prediction).

The use case shows an exercise of predition of breast cancer. The dataset is based on measurements made on digitized images of biopsies (fine needle aspiration) of a breast mass. The attributes are as described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. In the dataset, each case is classified as being benign or malignant, which is classified in the Diagnosis attribute. 

- ID number
- Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)

The mean, standard error, and worst or largets (mean of the three largest of these features were computed for each image, resulting in 30 features.

## 1. Prepare your environment:
- Create a `raw-data`directory in your project's directory to place external data files.
- install and load python libraries necessary to run the python code. This project requires the following libraries:
  - pandas
  - sklearn

In [None]:
# If you don't have pandas library installed, you can do it at the shell terminal
# with the following commands:
#
# $ pip3 install pandas
# $ pip3 install sklearn
# $ pip3 install seaborn

In [None]:
# import pandas library
import pandas as pd

## 2. Download the data file from Kaggle
Go to http://www.kaggle.com/uciml/breast-cancer-wisconsin-data and download the `data.csv` file. Place the file at the `raw-data` directory.

## 3. Read and preview data
Read the data file, and print the shape and a preview of the table:
- number of rows
- number of properties (columns)
- show the first two rows of data

In [None]:
cancer_data = pd.read_csv('./raw-data/data.csv')
pd.options.display.max_columns = len(cancer_data)
print(f'Number of entries: {cancer_data.shape[0]:,}\n'
      f'Number of features: {cancer_data.shape[1]:,}\n\n'
      f'Number of missing values: {cancer_data.isnull().sum().sum()}\n\n'
      f'{cancer_data.head(2)}')

The table was loaded to `pandas`, which has the possibility to show a preview of data (head):

In [None]:
cancer_data.head()

## 4. Clean and explore data

You can scroll to the last column of the table above and verify that it contains no values (NaN). We need to remove the last column with missing values:

In [None]:
cancer_data = cancer_data.drop('Unnamed: 32', axis=1)

It is possible to calculate descriptive statistics parameters of the attributes of the data set:

In [None]:
cancer_data.describe()

Next, let's calculate how many women have a confirmed cancer (a malignant breast tumor)?

In [None]:
cancer_data['diagnosis'].value_counts()

We can calculate these values as percentages:

In [None]:
round(cancer_data['diagnosis'].value_counts()*100/len(cancer_data)).convert_dtypes()

## 5. Visualize data

We can get a better insight of the data if we compare values for benign and malignant cases. Seaborn is one of the powerfull libraries to visualize data. 

In [None]:
import seaborn as sns; sns.set(style="ticks", color_codes=True)

In [None]:
radius = cancer_data[['radius_mean','radius_se','radius_worst','diagnosis']]
sns.pairplot(radius, hue='diagnosis',palette="husl", markers=["o", "s"],height=4)

We can do another visualization, adding linear regression lines.

In [None]:
texture = cancer_data[['texture_mean','texture_se','texture_worst','diagnosis']]
sns.pairplot(texture, hue='diagnosis', palette="husl",height=4, kind="reg")

Another visualization which display the histogram for each category, is called violinplot. We will do this in groups of ten variables.

In [None]:
# y includes our labels and x includes our features
y = cancer_data.diagnosis # M or B 
list_drp = ['id','diagnosis']
x = cancer_data.drop(list_drp, axis = 1 )

In [None]:
import matplotlib.pyplot as plt

data_dia = y
data = x
# standardization of the data
data_n_2 = (data - data.mean()) / (data.std())
data = pd.concat([y,data_n_2.iloc[:,0:30]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,6))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart",palette ={"B": "g", "M": "r"})
plt.xticks(rotation=90)

We can represent using box plots the worst values of the features. 

In [None]:
# box-plots
data = pd.concat([y,data_n_2.iloc[:,20:30]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.boxplot(x='features', y='value', hue='diagnosis', data=data, palette ={"B": "g", "M": "r"})
plt.xticks(rotation=45)

To explore correlations between independent variables, we can calculate the correlation matrix, and represented with a heatmap:

In [None]:
# correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

## 6. Build the model

We will calculate two models: on based on the [K-nearest neighors (KNN)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) algorithm, and the other based on [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression).

First, we will define the X (independent) and Y (dependent) variables:

In [None]:
X = cancer_data.iloc[:, 2:32].values
y = cancer_data.iloc[:, 1].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

It is important to divide the dataset in two subsets, one for training (creating) the model, and other for testing. This is important to make sure that the model is not overfitted, and can be applied to other data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In this example, we will try two models:
- K-Nearest Neighbor (KNN)
- Logistic Regression

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# KNN
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

# Logistic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_predictions = lr.predict(X_test)

## 7. Evaluate the model

We can calculate the accuracy of the models. This value returns the fraction of correctly classified samples, in the test subset.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

print(f'Accuracy scores:\n'
      f'KNN model:\t\t   {accuracy_score(y_test, knn_predictions):.3f}\n'
      f'Logistic regression model: {accuracy_score(y_test, lr_predictions):.3f}')

Another way of evaluation the model is to calculate the confusion matrix *C*, in which *C<sub>i,j</sub>* is the number of observations which true value is *i*, and was predicted to be *j*.
It gives the values of true negatives (*C<sub>0,0</sub>*), false negatives (*C<sub>1,0</sub>*), true positives (*C<sub>1,1</sub>*) and false positives (*C<sub>0,1</sub>*). 

In [None]:
matrix = confusion_matrix(knn_predictions, y_test)
sns.heatmap(matrix, cbar=False, annot=True)
plt.xlabel('Predict')
plt.ylabel('True')
plt.title('Confusion Matrix - Logistic Regression model')