### Data Mining Summer 2022 Lab:
#### Understanding and predicting Shark Presence in Near Shore Waters
#### Your Name:<br><br>
<p> This lab has one deliverable:<br>
Deliverable 1:  Domain Understanding, Data Exploration and Preparation, Decision Trees and Random Forests<br>
#### Deliverable 1
<p>Follow the steps below which represent the Cross Industry Standard Process for Data Mining or CRISP-DM).  Each step will be represented in code (run the block of code) and there will be places for you to review a document (<font color=RED>REVIEW:</font>), watch a video (<font color=RED>VIDEO:</font>), answer questions (<font color=RED>QUESTION</font>:) and add code (<font color=RED>CODE:</font>).  You will turn in the completed notebook by the end of the day Monday 8/8.</p>
<p>CRISP-DM Steps</p>
1. Problem Statement and Domain Understanding<br>
2. Input Libraries and Data<br>
3. Exploratory Data Analysis including Baseline for Evaluation<br>
4. Data Preprocessing for Modeling<br>
5. Modeling<br>
6. Evaluation<br>
7. Results and Future Work<br>
8. Citations<br>

### 1.  Problem Statement and Domain Understanding
<p>Problem Statement:  The purpose of this research is to improve the understanding of shark presence in near shore waters off of the coast of North and South Carolina,  Specifically, the research will investigate several years of summer time daily data including dates with documented shark attacks. Features involving weather, water, turtles, crabs and moon phases will be used in modeling to predict shark presence.</p>
<p>Domain Understanding:  Understanding the domain is an important first step of a data mining project.  We are exploring several years of data on shark attacks from the International Shark Attack File found at <a href="http://www.sharkattackfile.net/incidentlog.htm">Global Shark Attack File</a>.</p>
In order to understand the domain:<br><br>
<font color=RED>REVIEW:</font>  <a href="https://github.com/AKDDResearch/Shark-Attack/blob/master/SAS%20Shark%20Research%20Presentation%20Final.pdf">Developing a Recommender System for Shark Presence Along East Coast Beaches</a><br>

### 2. Input Libraries and Data

#### 2. A.  Import Libraries
<p>We are importing pandas and numpy for working with data, sklearn for scikit-learn to easily perform modeling, matplotlib for plotting and datetime to work with the date attribute.</p><p>You can simply run this code</p>


In [None]:
#some code so those pesky warnings from deprecated code won't appear
import warnings
def fxn():
    warnings.warn("deprecated", DeprecationWarning)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
#the rest of the imports
#pandas for working with datasets
import pandas as pd
#numpy for working with arrays
import numpy as np
#seaborn for plotting and styling visualizations
import seaborn as sns
#matplotlib for additional customization
import matplotlib.pyplot as plt
#scikit-learn for preprocessing and modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, r2_score, mean_squared_error
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
#datetime for working with dates
import datetime


<h4>2. B. Input Data, Review and Prepare Attributes</h4><br>
  <p>  NOTE:  This data has had transformations applied for the purpose of education and ease of understanding the process we use to apply data mining to predictive analysis.  Transformations include balancing the data set, discretization according to domain understanding and other methods, merging with other data sets according to date, and imputation or removal of null values by row or column. </p>
<p>Due to these changes, this particular data set should not be used for an actual production sytem for shark presence or attacks. For further studies, the data should be updated with additional years and rebuilt. It can be used, however, to gain an understanding of the problem in order to continue addressing the matter in a scientific manner.</p>
<p>We won't be using all of the attributes for our modeling, just a few of them. You can, however, use any of the attributes for your visualization.</p>

In [None]:
# encoding is a statement of the kinds of characters used
# this data set includes some special characters
# read the csv file sharkdata.csv into bdf
# you can examine the csv file on the github site for class
bdf = pd.read_csv('https://raw.githubusercontent.com/catawba-data-mining/CIS-3902-Data-Mining/main/sharkdata.csv', encoding="ISO-8859-1")
#let's take a look at the attributes and file size
bdf.info()

In [None]:
#let's take a look at the data
bdf.head()

### 2. C. Transform certain variables to categories for better analysis <br>
<p>This is scary looking code but you can simply run it!</p>
<p>We are changing lots of attribures from type object (the way Python imports the non-numeric attribures) to type category.  We are keeping the full dataset in bdf at this point just in case you want to use some of these attribues in visualization!  They will work best as categories after this transformation.</p>

In [None]:
#change object type attributes - most of the discretized features - to categorical
#object type can be difficult to visualize and model
bdf["turtleexactdiscretizeSC"] = bdf["turtleexactdiscretizeSC"].astype('category')
bdf["TurtleexactdiscretizeNC"] = bdf["TurtleexactdiscretizeNC"].astype('category')
bdf["TurtleAttackActivityDiscretized"] = bdf["TurtleAttackActivityDiscretized"].astype('category')
bdf["Area"] = bdf["Area"].astype('category')
bdf["Attack"] = bdf["Attack"].astype('category')
bdf["Timeofattack"] = bdf["Timeofattack"].astype('category')
bdf["Beach"] = bdf["Beach"].astype('category')
bdf["DissolvedO2discretize"] = bdf["DissolvedO2discretize"].astype('category')
bdf["salinitydiscretize"] = bdf["salinitydiscretize"].astype('category')
bdf["turbiditydiscretize"] = bdf["turbiditydiscretize"].astype('category')
bdf["temperaturediscretize"] = bdf["temperaturediscretize"].astype('category')
bdf["precipitationdiscretize"] = bdf["precipitationdiscretize"].astype('category')
bdf["pressurediscretize"] = bdf["pressurediscretize"].astype('category')
bdf["windspeeddiscretize"] = bdf["windspeeddiscretize"].astype('category')
bdf["precipitationmvadiscretize"] = bdf["precipitationmvadiscretize"].astype('category')
bdf["CrabLandingsDisc"] = bdf["CrabLandingsDisc"].astype('category')
bdf["Direction"] = bdf["Direction"].astype('category')
bdf["DirectionDisc"] = bdf["DirectionDisc"].astype('category')
bdf["DirectionDiscInt"] = bdf["DirectionDiscInt"].astype('category')
bdf["MoonPhaseCat"] = bdf["MoonPhase"].astype('category')
bdf["MoonPhaseCatExtend"] = bdf["MoonPhaseIntExtend"].astype('category')
#change attack and moonphase cat to codes to help with scatter matrix visualization
#MoonPhaseCat is the actual MoonPhase as a string
#MoonPhaseCatExtended is the Extended MoonPhase
#0 is Quarter moons, 1 is wan gibb and wax cres, 2 is wax gibb and wan cres, 3 is Full and New
#DirectionDiscInt is the Wind Direction discretized
#NE = 1, E = 2, SE = 3, S = 4, W = 5, SW = 6
bdf["AttackCat"] = bdf["Attack"].cat.codes
bdf["MoonPhaseCatExtendCodes"] = bdf["MoonPhaseCatExtend"].cat.codes
bdf["DirectionDiscIntCodes"] = bdf["DirectionDiscInt"].cat.codes
#fix date time
bdf["Date"] = bdf["Date"].astype('category')
format_str = '%d/%m/%Y' # The format
bdf["Date"] = bdf["Date"].apply(pd.to_datetime)
#datetime.datetime.strptime(bdf["Date"], format_str)
#print info again on data frame and attributes
bdf.info()

In [None]:
#just run this code too
# are you curious about what we have now?
# let's take a look at the first few rows
bdf.head()

<font color=RED>QUESTION:  </font>Notice that many attributes are represented by the raw value and a discretized value.  For example, turbidity is represented in "Turbidity" and "turbiditydiscretize".  What is the purpose of a discretized attribute like "turtleexactdiscretizeSC" which is built from "TurtleExactCountSC"? Place your answer in this markdown block.<br>

#### 2 C. Create "df" dataframe from "bdf" to only include the attributes needed for modeling.
<p>Use "bdf" for visualizations, etc. as it includes lots of data including discretized, categorical variables and date.  Use "df" for the attributes we are using for decision trees, random forest and Knn modeling.</p><p>Note:  This is a way to change the data for different types of analysis.  You can build df to include the features you want. Notice how we do not include two perfectly correlated variables in the same df dataset for modling such as  "Salinity" and "salinitydiscretize". We will use the numeric data in its origional form for modeling.</p>
<p>You can simply run this code but do pay attention to the attributes we will use for modeling.</p>

In [None]:
#df will include numeric attributes and attack as the target attribute
#for supervised learning
#we are leaving the turtles and crabs out for now, also temperature (it's always hot in summer) and more!
df = bdf[["AttackCat", "MoonPhaseCatExtendCodes", "StationPressure",
          "WindSpeed", "Salinity", "Turbidity", 
          "DissolvedO2", "DirectionDiscIntCodes"]]
#take a look
df.info()

In [None]:
# examine first 15 records
df.head(15)

### 3. Exploratory Data Analysis

#### 3. A. Establish a baseline measure for evaluation

Establish a baseline - we want to do better than the baseline otherwise why not just keep the baseline for your predictions?  A baseline is typically an easily calculated value that often represents a simple average or some measure that is currently in use to make predictions.  Our baseline is the overall percent of attacks for the data that is represented.  This shows if you always pick "No" for whether a shark will be in near shore waters with this data set, you will be accurate 62% of the time.

In [None]:
#we are going to calculate a simple percentage of attack = no or 0
#remember this data set has been balanced due to shark attack
#being a rare event
df["AttackCat"].value_counts()

In [None]:
randomAcc = df["AttackCat"].value_counts().max() / df["AttackCat"].value_counts().sum()
randomAcc = round(randomAcc * 100, 2)
print("Accuracy: {randomAcc}%".format(randomAcc = randomAcc))

Our predictions need to be better than the baseline measure of accuracy for predicting whether or not a shark will be in near shore water.

#### 3. B.  Exploratory Data Analysis

Describing a data frame with df.describe() shows statistics for numeric variables including quartiles.

In [None]:
df.describe().transpose()

#### 3. C. Scatter plot matrix:  This powerful visualization can answer the following questions:<br>
<p>Run the code and examine the results - they will appear after the long display of information on an array used in construction of the scatter matrix - you can disregard this.</p>


In [None]:
#we are going to set up some colors for attack = 0 (no attack) or 1 (attack)
attack_colors = {0:'blue', 1:'red'}
pd.plotting.scatter_matrix(df.loc[:,"MoonPhaseCatExtendCodes":"DirectionDiscIntCodes"],figsize=(30,30),grid=True,
                           marker='o', c= df['AttackCat'].map(attack_colors))


<font color=RED>QUESTION:  </font> <p>Are there any scatter plots that look interesting?  </p>
<p> Look for:
<p>
•Are there any pair-wise relationships between different variables? And if there are relationships, what is the nature of these relationships?<br>
•Are there any outliers in the dataset?<br>
•Is there any evidence of clustering by groups present in the dataset on the basis of a particular variable?</p>

<font color=RED>CODE:  </font>Add four interesting visualizations of your choice. Include markdown describing what you have learned about the data from your visualizations. You can use the df or bdf dataset at this point.<br><br>Here is a resource with examples that may give you some ideas for visualization:  <a href="https://elitedatascience.com/python-seaborn-tutorial#:~:text=The%20Ultimate%20Python%20Seaborn%20Tutorial%3A%20Gotta%20Catch%20%E2%80%98Em,it%20all%20together.%20...%2010%20Pok%C3%A9dex%20%28mini-gallery%29.%20">Using Seaborn for Visualization</a><br>

### Markdown for Visualization 1: (explain what you learn from the visualizaiton)

In [None]:
# Code for Visualization 1

### Markdown for Visualization 2: (explain what you learn from the visualizaiton)

In [None]:
# Code for Visualization 2

### Markdown for Visualization 3: (explain what you learn from the visualizaiton)

In [None]:
# Code for Visualization 3

### Markdown for Visualization 4: (explain what you learn from the visualizaiton)

In [None]:
# Code for Visualization 4

### 4.  Data Preprocessing for Modeling
<p>Our first example will explore a combination of categorical and numeric features in the dataset.  The numeric features have not been scaled which will handle the different ranges of values present in the continuous variables.  We will be using a powerful feature of scikit-learn, the standard scaler, to scale the  variables.  The Standard Scaler that we will use is the z-score scaler so 0 means the mean, -1 and +1 is one standard deviation from the mean, etc.  The lower the value the further away from the mean in a negative way, the higher the value the further away from the mean in a positive way.  We will then explore three models:  Knn, Decision Trees and Random Forest and compare the accuracy of the models. We will also use train and test data with a default train test split.</p>

#### 4. A. Setting up the models<br><p>You can simply run this code, we are getting some variables set up so we can run our three machine learning models. </p>

In [None]:
# setting y to the target variable attack yes or 1, no or 0
y = df["AttackCat"]
# dropping the target variable from X
# you can also drop other variables
X = df.drop(["AttackCat"], axis=1)

# setting the parameters for the models
knn = KNeighborsClassifier(n_neighbors=3)
rfc = RandomForestClassifier()
dt = DecisionTreeClassifier(random_state=0)

clNames = ["KNN", "Random Forests", "Decision Trees"]
classifiers = [knn,rfc,dt]
classifierScores = []

#### 4. B. Numeric Feature Scaling with the Standard Scaler (can improve accuracy)<br><p>You can simply run this code - we are scaling the numeric variables around the mean (z scores) so the ranges will all be the same.  We are excluding the Moon Phases and Wind Direction because we want their actual codes, not a z-score representation.</p>

In [None]:
# set scaler to StandardScaler
scaler = StandardScaler()
# save a feature that is not included, add it back at end
name_var = X['MoonPhaseCatExtendCodes']
name_var2 = X['DirectionDiscIntCodes']
X.drop(columns=['MoonPhaseCatExtendCodes','DirectionDiscIntCodes'])
# get numeric data in num_X
num_X = X.select_dtypes(exclude=['category'])

# update the cols with their normalized or scaled values
X[num_X.columns] = scaler.fit_transform(num_X)

In [None]:
#add MoonPhaseCatExtendDisc and then let's take a look at X
X['MoonPhaseCatExtendCodes']=name_var
X['DirectionDiscIntCodes']=name_var2
X.head()

#### 4.C. For Knn, Changing Categorical Features to Numbers
<p>We have to convert categorical features to a numerical value for Knn.  In order to encode this data, you could map each value to a number. e.g. Overcast:0, Rainy:1, and Sunny:2.</p>
<p>
This process is known as label encoding, and sklearn conveniently will do this for you using Label Encoder.<p>

#### A refresher on moon phases and Wind Direction:  back to domain knowledge! (CRISP-DM is a cyclical process)<br>
<p>At Full and New Moon, the effect on tides is the greatest and causes "high tide".  Neap tides are when the tides are decreasing after full and new moon; Spring tides are when the tides are rising toward full and new moon phases.</p>
<p>A feature that is already encoded is Moon Phase Cat Extend Codes:  0 - quarter moons, 1 - Neap Tides, 2 - Spring Tides, 3 - Full and New Moon</p>
<p>Another features is Wind Direction - 1 - NE, 2 - E, 3 - SE, 4 - S, 5 - W, 6 - SW.  Surf fishers say the fishing is best with south and westerly winds.</p>
<p>From <a href="https://sciencing.com/effects-moon-phases-ocean-tides-8435550.html">Sciencing.Com Moon Phases and Ocean</a></p>
<p>You can just run this code;Sci-kit learn does all the work for you!</p>

In [None]:
# Label Encoding with Sci-kit learn
from sklearn import preprocessing
# creating le LabelEncoder
le = preprocessing.LabelEncoder()
# transforming just the categorical and int or non scaled attributes (do not do this on the numeric attributes)
X['MoonPhaseCatExtendCodes'] = le.fit_transform(X['MoonPhaseCatExtendCodes'])
X['DirectionDiscIntCodes'] = le.fit_transform(X['DirectionDiscIntCodes'])
X.head()

#### 4. C. Create the Train-Test-Split<br><p>You can run this code - we have learned about train-test-split for building and evaluating models with our previous class work.</p>

In [None]:
# convert string variables to One Hot Encoding if needed for certain models
# such as Association Rules (not in this Deliverable)
# X = pd.get_dummies(X)

# build train test split for modeling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

### 5.  Modeling
<p>We will explore three models:  Knn, Decision Trees and Random Forest and compare the accuracy of the models. We will be using supervised machine learning with the train-test-split data</p>

#### 5.  A. Knn or K Nearest Neighbors
<p>KNN requires scaling of data because KNN uses the Euclidean distance between two data points to find nearest neighbors. Euclidean distance is sensitive to magnitudes. The features with high magnitudes will weight more than features with low magnitudes. KNN also is not suitable for large dimensional data. We need to convert the categorical data to </p>
<p><font color=RED>VIDEO: </font>Watch the video to learn more about Knn for predictive models.</p>
<p><a href="https://youtu.be/4HKqjENq9OU">Understanding Knn</a></p>
<p><a href="https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn#:~:text=KNN%20requires%20scaling%20of%20data%20because%20KNN%20uses,KNN%20also%20not%20suitable%20for%20large%20dimensional%20data">Datacamp:  KNN Tutorial (short)</a></p>

#### 5. B. Knn Modeling<br>
<p>Run the four code blocks below for KNN Modeling - the last shows the accuracy.  Remember our baseline!</p>


In [None]:
# We will start with k nearest neighbors = 3
# build the knn model with the training data
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

In [None]:
# Prediction, print confusion matrix for the test data
kPred = knn.predict(X_test)
print(confusion_matrix(y_test, kPred))

In [None]:
# print classification report
print(classification_report(y_test,kPred))

In [None]:
# print accuracy
knnAcc = accuracy_score(y_test,kPred) * 100
knnAcc = round(knnAcc, 2)
print("Accuracy: {knnAcc}%".format(knnAcc=knnAcc))

#### 5. C. Random Forest<br>
<p>Run the four code blocks below for Random Forest Modeling - the last shows the accuracy. See if we are improving over the Knn model for prediction!</p>


In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [None]:
rfcPred = rfc.predict(X_test)
print(confusion_matrix(y_test, rfcPred))

In [None]:
print(classification_report(y_test,rfcPred))

In [None]:
rfAcc = accuracy_score(y_test,rfcPred) * 100
rfAcc = round(rfAcc, 2)
print("Accuracy: {rfAcc}%".format(rfAcc=rfAcc))

#### 5. D. Decision Trees<br>
<p>Run the three code blocks below for Decision Tree Modeling - the last shows the accuracy. See if we are improving over the Knn and Random Forest models for prediction! The last code block visualizes the tree.</p>


In [None]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(random_state=0, criterion='entropy')
dt.fit(X_train, y_train)

In [None]:
dtPred = dt.predict(X_test)

dtAcc = accuracy_score(y_test,dtPred) * 100
dtAcc = round(dtAcc, 2)
print("Accuracy: {dtAcc}%".format(dtAcc=dtAcc))

In [None]:
# let's visualize the decision tree
dt_feature_names = list(X.columns)
dt_target_names = [str(s) for s in y.unique()]
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(dt, 
                   feature_names=dt_feature_names,  
                   class_names=dt_target_names,
                   filled=True)

<font color=RED>QUESTION:</font><p>Some researchers are studying the effect of wind on shark presence in near shore waters.  This is a relatively new area of research. Take a look at what they are saying: <a href="https://abc7.com/shark-attack-attacks-sharks-tagging/5388879/#:~:text=They%20said%20sea%20breeze%20at%20the%20sites%20of,white%20and%20others%20head%20up%20the%20Eastern%20Seaboard">Does the sea breeze make shark attacks more likely?</a></p>
<p>Another resource:  <a href="https://www.yahoo.com/gma/meet-2-scientists-trying-forecast-104935603.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuYmluZy5jb20vc2VhcmNoP3E9RHIuK0dyZWcrU2tvbWFsK2FuZCtOYXRpb25hbCtXZWF0aGVyK1NlcnZpY2UrbWV0ZW9yb2xvZ2lzdCtKb2UrTWVyY2hhbnQmc3JjPUlFLVNlYXJjaEJveCZGT1JNPUlFU1I0Tg&guce_referrer_sig=AQAAAJjrmPNw7ZiCXSsVFmG7PYyxb3WaHU0MFNzSPhE-60Xgzc6-V-mHo4HRP1yMGkKKvnVHAxhwj7ZL-3hd9JDcKvdlsUf9k0oIEp5xEpKEGxyo8CEOPpF-ADJkan8ajzUIjj8f-lF8srjWtrGA-fP7HeLGZpZ98RabEdokIuerouIJ">Sea Breeze and Shark Attacks</a></p>
<p>Based on our work, what do you think?  Does wind play a role in shark presence, even attacks, in near shore waters off of the coast of North and Scuth Carolina (our data)?</p>

<font color=RED>QUESTION: </font> <p>What other variables seem to play a role?  Going by the decision tree or your earlier visualizations, research features or variables of interest and document your findings.</p>

#### 6. Evaluation<br>
<p>Let's compare the models.</p>
<p>Then try a prediction to see if the decision tree model predicts 1 for shark presence, or 0 for no shark presence!</p><p>Run the first code block and second code block.  You can change the variables for the second code block - remember numeric variables are scaled so 0 is close to the mean, negative numbers are lower than the mean, positive numbers are above the mean - standard deviaion units so -3 and +3 are extremely low and high.</p>

In [None]:
classifiers = ["KNN", "Random Forests", "Decision Trees"]
accuracies = [knnAcc, rfAcc, dtAcc]
x = np.arange(len(classifiers))
ytickLabels = ["0%","10%","20%","30%","40%","50%","60%","70%","80%","90%","100%"]
yticks = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
plt.figure(figsize=(10, 15))
plt.bar(x, accuracies, align='center', alpha=0.5)
plt.xticks(x, classifiers)
plt.yticks(yticks, ytickLabels)
plt.ylabel('Accuracy')
plt.title('Classifier Accuracy')
plt.show()

In [None]:
#let's try a prediction
#we will input a sample MoonPhaseCatExtendCodes, StationPressure
#WindSpeed, Salinity, Turbidity, DissolvedO2, DirectionDiscIntCodes
# you can change the values of the variables
moon = 3
stationpressure = 1.0
windspeed = 2
salinity = 1
turbidity = 0
dissolved02 = 0
winddirection = 8
# dt.predict uses the decision tree model that has been built previously
prediction = dt.predict([[moon,stationpressure,windspeed,salinity,turbidity,dissolved02,winddirection]])
# let's see what the prediction is - 1 is yes to shark presence, 0 is mo
print("The prediction is ", prediction)