### GreenDS

# Fundamentals of Agro-Environmental Data Science

### Introduction

The purpose of this exercise Jupyter Notebook is to demonstrate the workflow in a data science project using this IDE. 

# Examples of Data Science Use Cases

## Data Science in Healthcare - Predicting Breats Cancer

This example is based on the following [Use Case](https://www.datacamp.com/blog/data-science-use-cases-guide) and Kaggle [example](https://www.kaggle.com/code/vincentlugat/breast-cancer-analysis-and-prediction).

## 1. Prepare your environment:
- create a `raw-data` directory in your project's directory to place external data files.
- install and load python libraries necessary to run the python code.

In [None]:
# If you don't have pandas library installed, you can do it at the shell terminal
# with the following command
#
# $ pip3 install pandas
# $ pip3 install sklearn

# import pandas library
import pandas as pd

## Read data
You need to download data from http://www.kaggle.com/uciml/breast-cancer-wisconsin-data.

In [None]:
cancer_data = pd.read_csv('./raw-data/data.csv')
pd.options.display.max_columns = len(cancer_data)
print(f'Number of entries: {cancer_data.shape[0]:,}\n'
      f'Number of features: {cancer_data.shape[1]:,}\n\n'
      f'Number of missing values: {cancer_data.isnull().sum().sum()}\n\n'
      f'{cancer_data.head(2)}')

In [None]:
cancer_data

Remove last column with missing values.

In [None]:
cancer_data = cancer_data.drop('Unnamed: 32', axis=1)

How many women, in %, have a confirmed cancer (a malignant breast tumor)? 

In [None]:
round(cancer_data['diagnosis'].value_counts()*100/len(cancer_data)).convert_dtypes()

In [None]:
X = cancer_data.iloc[:, 2:32].values
y = cancer_data.iloc[:, 1].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# KNN
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_predictions = knn.predict(X_test)

# Logistic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_predictions = lr.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

print(f'Accuracy scores:\n'
      f'KNN model:\t\t   {accuracy_score(y_test, knn_predictions):.3f}\n'
      f'Logistic regression model: {accuracy_score(y_test, lr_predictions):.3f}')