Notes by - Kiran A Bendigeri
Please Read 'Read me' file.

Python

In [None]:
'''Machine learning is about extracting knowledge from data. It is a research field at the
intersection of statistics, artificial intelligence, and computer science and is also
known as predictive analytics or statistical learning.

Python has become the lingua franca for many data science applications. It combines
the power of general-purpose programming languages with the ease of use of
domain-specific scripting languages like MATLAB or R. Python has libraries for data
loading, visualization, statistics, natural language processing, image processing, and
more. This vast toolbox provides data scientists with a large array of general- and
special-purpose functionality. One of the main advantages of using Python is the abil‐
ity to interact directly with the code, using a terminal or other tools like the Jupyter
Notebook, which we’ll look at shortly. Machine learning and data analysis are funda‐
mentally iterative processes, in which the data drives the analysis. It is essential for
these processes to have tools that allow quick iteration and easy interaction.

scikit-learn is a very popular tool, and the most prominent Python library for
machine learning. It is widely used in industry and academia, and a wealth of tutori‐
als and code snippets are available online. scikit-learn works well with a number of
other scientific Python tools

NumPy is one of the fundamental packages for scientific computing in Python. It
contains functionality for multidimensional arrays, high-level mathematical func‐
tions such as linear algebra operations and the Fourier transform, and pseudorandom
number generators.
'''
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))

'''
SciPy is a collection of functions for scientific computing in Python. It provides,
among other functionality, advanced linear algebra routines, mathematical function
optimization, signal processing, special mathematical functions, and statistical distri‐
butions.'''
from scipy import sparse
# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))

# Convert the NumPy array to a SciPy sparse matrix in CSR format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n{}".format(sparse_matrix))

'''
matplotlib is the primary scientific plotting library in Python. It provides functions
for making publication-quality visualizations such as line charts, histograms, scatter
plots, and so on. 
'''
import matplotlib.pyplot as plt
# Generate a sequence of numbers from -10 to 10 with 100 steps in between
x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")
plt.show()

'''
pandas is a Python library for data wrangling and analysis. It is built around a data
structure called the DataFrame
'''
import pandas as pd
# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
                'Location' : ["New York", "Paris", "Berlin", "London"],
                'Age' : [24, 13, 53, 33]
                }
data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
print(data_pandas)

'''version'''
import sys
print("Python version: {}".format(sys.version))
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import matplotlib
print("matplotlib version: {}".format(matplotlib.__version__))
import numpy as np
print("NumPy version: {}".format(np.__version__))
import scipy as sp
print("SciPy version: {}".format(sp.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))

'''
A First Application: Classifying Iris Species
we have measurements for which we know the correct species of iris, this is a
supervised learning problem. In this problem, we want to predict one of several
options (the species of iris). This is an example of a classifcation problem. The possi‐
ble outputs (different species of irises) are called classes. Every iris in the dataset
belongs to one of three classes, so this problem is a three-class classification problem.
The desired output for a single data point (an iris) is the species of this flower. For a
particular data point, the species it belongs to is called its label.

The data we will use for this example is the Iris dataset, a classical dataset in machine
learning and statistics. It is included in scikit-learn in the datasets module. We
can load it by calling the load_iris function:
'''
#Load Data
from sklearn.datasets import load_iris
iris_dataset = load_iris()
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))
print(iris_dataset['DESCR'][:193] + "\n...")
print("Target names: {}".format(iris_dataset['target_names']))
print("Feature names: \n{}".format(iris_dataset['feature_names']))
print("Type of data: {}".format(type(iris_dataset['data'])))
print("Shape of data: {}".format(iris_dataset['data'].shape))
print("First five columns of data:\n{}".format(iris_dataset['data'][:5]))
print("Type of target: {}".format(type(iris_dataset['target'])))
print("Shape of target: {}".format(iris_dataset['target'].shape))
print("Target:\n{}".format(iris_dataset['target']))
'''data is usually denoted with a capital X, while labels are denoted by
a lowercase y.
train_test_split'''
from sklearn.model_selection import train_test_split
#import mglearn
X_train, X_test, y_train, y_test = train_test_split(
                                iris_dataset['data'], iris_dataset['target'], random_state=0)
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
#grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
#                        hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)
'''The most important parameter of KNeighbor
sClassifier is the number of neighbors, which we will set to 1:'''
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
'''To build the model on the training set, we call the fit method of the knn object'''
knn.fit(X_train, y_train)
'''We can now make predictions using this model on new data for which we might not
know the correct labels. Imagine we found an iris in the wild with a sepal length of
5 cm, a sepal width of 2.9 cm, a petal length of 1 cm, and a petal width of 0.2 cm.'''
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
        iris_dataset['target_names'][prediction]))
'''
Evaluating the model
we can make a prediction for each iris in the test data and compare it
against its label (the known species). We can measure how well the model works by
computing the accuracy, which is the fraction of flowers for which the right species
was predicted:'''
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))
acc =float(format(np.mean(y_pred == y_test)))
acc= acc*100
print("The test set accuracy is about " +str(acc)+"%")



