# Introduction to Data Science in Scikit-Learn

*Note: Some of the  material is gotten from online, and there might be no proper scitation:*

There will be four sub parts for this introduction. For the most part, everything would be considered basic. However, some advanced concepts are used. In those specific areas, don't worry, they will become familier as we go through our group work on machine learning.

The notbooks are devided as follows:
- **Basics of python**
- **Numpy and Matplotlib for scientific computing and plotting respectively**
- **Pandas for data exploration and manipulation**
- **Scikit-learn for machine learning**

In this folder we have introduction to ML using Scikit-learn. The series is divided in to five 
notebooks representing the various parts.

#### Part 1: Basic introduction to python
   In this notebook, we will go through the basics of pythons including
- Basic types
- Containers like list, dict, sets, etc
- Functions as well as function generators like enum, range
- classes and objects
- Reading and writing to files
    
#### Part 2: Numpy and Matplotlib for scientific computing and plotting respectively
In this notebook, we will go through `numpy` and `matplotlib` libray features which include
 - Creating and manipulating ndimensiona vectors
 - Getting statistics from vectors
 - Basic algebra
 - Vectorized computations
 - Basic 1D and 2D pltting
 - Customizing plots
 - Histogram plots etc
 
 
#### Part 3: Pandas for data exploration and manipulation
In this notebook, we will go through data exploration using pandas. The concepts covered includes
 - Creating Series and Dataframes
 - Manipulation of rows and columns
 - Performing basic stats
 - Using special functions play with the data held
 - Data visualizations with plotting 
 - Measuring correlation with confusion matrix
 
 
#### Part 4: Scikit-learn for machine learning
In this notebook, we will go through simple machine learning concepts as way to get a bit familia with 
Scikit-learn. We will cover
 - Types of machine learning
 - How does the full process looks like?
 - Getting data ( synthetic and toy data)
 - Train clustering algorithm on synthetic data
 - Data exploration of the Iris data set
 - Train a KNN algorithm on Iris data
 - Evaluate performance of a train model
 - Future propectives

# Part 1. Basics of python foundations

Python is a general progromming language like C++ and Java but is not strongly type. This means one doesn't need to declare a variable before initializing it. We will use python provided by [Anaconda](anaconda.com) as it makes the installation and management of pythons very easy. In the near future we will create a `virtual` environment to manage packages for some specialized projects. We will use the jupyter notbook, which is an interactive shell. Aan excellent beginner intro can be found [here](https://www.dataquest.io/blog/jupyter-notebook-tutorial/).


## 1. Jupyter notbooks basics
[Read this blog for a start.](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)
Two important concepts are a
-    A code cell contains code to be executed in the kernel and displays its output below.
-    A Markdown cell contains text formatted using Markdown and displays its output in-place when it is run

Below, you’ll find a list of some of Jupyter’s keyboard shortcuts.. You’re not expected to pick them up immediately, but the list should give you a good idea of what’s possible.

-    Toggle between edit and command mode with Esc and Enter, respectively.
-    Once in command mode:
     -   Scroll up and down your cells with your Up and Down keys.
     -   Press A or B to insert a new cell above or below the active cell.
     -  M will transform the active cell to a Markdown cell.
     -   Y will set the active cell to a code cell.
     -  D + D (D twice) will delete the active cell.
     -   Z will undo cell deletion.
     -   Hold Shift and press Up or Down to select multiple cells at once.
          -  With multiple cells selected, Shift + M will merge your selection.
-    Ctrl + Shift + -, in edit mode, will split the active cell at the cursor.
-    You can also click and Shift + Click in the margin to the left of your cells to select them.


### 2  Basic types & Arithmetic operators
bool, float, int, str

In [None]:
x = True
y = 10; z = 10.19
w = "Welcome to python!, that was today's date!"

In [None]:
print(x, y, z, w)

**Operators** 
+,  -,  /,  //,  **, %

In [None]:
x+y

In [None]:
z+y

In [None]:
#Should get an error
# w+y # why? Bceuase we can't add float to text

Operators are overloaded. There is no way to cast a string to int

### 3. Strings
Basics
 - single quote, double quote, escape charater for those that span multiple pages
 - strings are immutable objects (any method on it returns a new object without changing the string)
 - strings indexing and slicing
 - different ways of printing a string

Methods, .upper(), .lower(), .replace(), .startswith(),...

In [None]:
w

In [None]:
w[0:18]

In [None]:
#Get first to chars
w[:10]

In [None]:
#Get the last char
w[-1]

Exploring string methods

In [None]:
w.upper()

In [None]:
w

In [None]:
w[18:].replace(' ', '- ')

##  Data Structures
### 4 List: 
A container of assorted quantities
 - `list` types
 - mutable
 - sorting
 - variable size
 - slicing and indexing
 - `list` methods

In [None]:
list_1 = [x, y, x, w]

In [None]:
dir(list_1)

In [None]:
list_1[0]

In [None]:
list_1[3]

In [None]:
list_2 = []

### 5 Dictionary: 
A container to hold key and values
 - `dict` types
 - key-value pairs
 - keys must be immutable types
 - fast loop-up 

In [None]:
dict_1 = {"boolx":x, "inty":y, "floatz":z, "stringw":w}

In [None]:
dict_1

In [None]:
dict_1['inty']

### 6 Set: 
A container to hold unique values
 - `set` stores immutable values

In [None]:
set_1 = {w, x, y, z}

In [None]:
#No subscribing of sets, gets an error
##set_1[0]

In [None]:
set([4,4,1,2,2,2,3,4,5,5,5,6]) // Removes duplicates

### 7 Tuples: 
A container that is similar to a list but is frozen or fixed size
 - `tuples` are immutable but it values values can be mutable
 - comma separated values create a tuple. Optional paranthesis

In [None]:
tup_1 = x, y, z, w  # Same result as (tup_1 = x, y, z, w)

In [None]:
tup_1

In [None]:
tup_1[1:3]

### 8 Conditionals

A conditional sentence; a statement that depends on a condition being true or false.
Examples ` if/ elsif/ else`

#### Iteration
 - `while` loop
 - `for` loop
 - useful keywordes: `continue, break`
 - useful onjects: `range, enum, zip`

In [None]:
if (x > y):
    print("Yeh!")
elif(y < z):
    print("z = %.2f, is greater than y = %d"  %(z, y))

In [None]:
for key, val in dict_1.items():

    print("%s : %s"  %(key, val))

In [None]:
print(list_1, tup_1)

### 9 Generators and Comprehensions
 - `Generators` are advance but basically functions like an on-demand `list`
 - This is a pythonic way to go through the containers below just like a loop
 - `list, set, and dict` accepts comprehension.

Consider the example below that uses a list comprehension, generator and reminder function

In [None]:
range(10)

Range is a generator and won't return anything unless asked. So lets askt it to send everything to list

In [None]:
list(range(10))

In [None]:
list_comp = [i**2 for i in range(10) if i%2 ==0]

In [None]:
list_comp

In [None]:
get_comp = (i**2 for i in range(10) if i%2 ==0)

In [None]:
get_comp

In [None]:
set_comp = {i**2 for i in range(10) if i%2 ==0}

In [None]:
set_comp

### 10 Functions
we will be able to write our own functions. Those that return a value or not, and those that takes zero or more arguments. 
Some in-bult and special functions include the following `pow`, `range()`, `enumerate`, `zip`,

In [None]:
pow(5, 2)

In [None]:
for k in enumerate(set_1):
    print(k)

In [None]:
list_2 = list_1

In [None]:
def print_set(s):
    if not isinstance(s, set):
        print(" Objest is not s set type")
    else:
        print(s)

In [None]:
print_set(set_1)

In [None]:
print_set(list_1)

On working with list, and also some other containers, python basically use pass them to function by reference. So in general, python use python for container access. This can be a problem when we compair the capabilities of numpy and list. As an example consider the following

In [None]:
def sqrt(list):
    newList = [] # list tp hold values
    for el in list:
        newList.append(el-1)
    return newList

In [None]:
list_sq = [0, 4, 16, 36, 64]

print("Before function Call: ", list_sq)

list_sqrt = sqrt(list_sq)
print("Function Call returnx: ", list_sqrt)

In [None]:
print("After function Call: ", list_sq)

In [None]:
def fsq(x):
    return x**2
def fl(f):
    return [f(k) for k in range(5)]
sq = fl(fsq)
sq

### 11 Classes
Basically, everything we have worked with is an object. We will expand on classes since they are relevant for understanding how the scikit-learn API works. Classes are "templates" for creating objects (this is called "instantiating" objects). An object is a collection of special "functions" (a also called "method") and attributes

We will create a vector class that can do the first three of the following
- initicalise
- return the norm
- multiply by scaler
- print the vector

In [None]:
class vector:
    def __init__(self, x=0, y=0, dim=2):
        self.x = x
        self.y = y
        self.dimension = dim
    def norm(self):
        return pow(self.x**2 + self.y**2, 1/2)
    def __repr__(self):
        print('vetor( {}, {} )'.format(self.x, self.y))

In [None]:
v = vector(3, 4)

In [None]:
v.norm()

Nothing is private in class methods.
- A method preceded by an underscore a "private" method -- i.e the method is meant to be used internally but not by the user directly (also, it does not show up in the "help" documentation)
- A method with a double-underscore is a "stronger" indicator for methods that are supposed to be private, and while users can access these but should not temper with them

In [None]:
class Matrix (vector):

    def __init__(self, dim):
        super(vector, self).__init__(dim)
        self.dimension = vector(dim)

### Reading files

we will be able to read various types of files as well as save them.
 - f = `open`('filename.txt', 'r') create a handle
 - f.`close()` 
 - we can also do `with open`('filename.txt', 'r')
 - mode include read 'r', qrite 'w' etc
 
 Download the file provided. We will play around with it.

In [None]:
fh = "data/iris.csv"
f = open(fh)

i = 0
for line in f:
    print(line)
    if i==5:
        break
    i += 1

One way to ensure that the file is good, is by using `with`

In [None]:
with open(fh) as f:
    for line in f:
        print(line[:8])
        
        if line[0]=='5':
            break

### Import

Using import, a handle is created to all the packages available via the import name. We will see this below when we explore the packages in anaconda

 - import numpy
 - from numpy import *somefunc*
 
 
**Examples**
`import copy`, 
`import random.random`

### Resources
 - https://docs.python.org/3/tutorial/