# <font color='SEAGREEN'>Day 1</font>
# <font color='MEDIUMSEAGREEN'>Introduction to Types of Data</font>
## Python
Python is a popular language because it is open source, easy to learn, and has support for many popular libraries.
There are two popular editions for Python: Python 2.X and Python 3.X. Python 2.X will be out of market soon, so it is better for us to get more familiar with the third edition.

The main strength of Python for machine learning and data science is the abundance of dependable
libraries. There are packages for scientific computing (Numpy and Scipy), machine learning
(Scikit-learn), image processing (Scikit-image), computer vision (OpenCV), data visualization
(Matplotlib) and deep learning (Pytorch, Tensorflow, MxNet, etc.). Most of these libraries can
be installed with a single **pip** command, which makes setup very easy.

## Jupyter Notebook

Jupyter divides Python code into neat units called a cell. There are two types of cell!
- **Markup cells** contain text, like this cell. Double click them to edit the text, and then press **< Shift > + < Enter >** to render it with formatting. Try double-clicking this text!
- **Code cells** contain executable python code. Click them to edit the code, then press **< Shift > + < Enter >** to execute the block. Try it out below!


In [1]:
print("Hi, I'm a code cell! Click me and press shift + enter.")

Hi, I'm a code cell! Click me and press shift + enter.


You can add new cells using the plus sign on the menu above. Take a few minutes to look at the different options offered in the menu! There are a few that are especially useful: 

- File Menu 
    - "Download As" lets you save your notebook to your computer as a .ipynb file
- Kernel Menu
    - "Restart the kernel" clears the output of every cell. Can be useful if your code gets stuck!
- Cell Menu
    - "Cell Type" lets you change the type of cell you are working with
    
## Python Refresher
First, let's do some exercises to refresh our memory of a few Python concepts.
### Python objects, basic types, and variables
Everything in Python is an **object** and every object in Python has a **type**. Some of the basic types include:

- **`int`** (integer; a whole number with no decimal place)
- **`float`** (float; a number that has a decimal place)
- **`str`** (string; a sequence of characters enclosed in single quotes, double quotes, or triple quotes)
- **`bool`** (boolean; a binary value that is either 'True' or 'False')
- **`NoneType`** (a special type representing the absence of a value)

In Python, a **variable** is a name you specify in your code that represents a specific **instance** of an object

Defining variable names helps you remember what an object is supposed to represent or what you want to do with that object (so pick meaningful names!). Variables also allow you a lot of flexibility, letting you modify their value without having to know exactly what the new value should be. You'll see this later. 
<hr>



In [2]:
# Take a guess!
some_num1 = 1
some_num2 = 4
(some_num1 + some_num2) * some_num2

20

In [3]:
# What type of variable will the result be?
some_num1 + some_num2 == 5

True

In [4]:
# What might this do?
simple_string1 = 'an example '
simple_string2 = "of strings "
simple_string1 + simple_string2

'an example of strings '

In [5]:
# Important! Notice that the string was not modified
simple_string1

'an example '

In [6]:
# Are these two expressions equal to each other?
simple_string1 == simple_string2

False

In [7]:
# Add and re-assign
simple_string1 += 'that re-assigned the original string'
simple_string1

'an example that re-assigned the original string'

### Basic containers

> Note: **mutable** objects can be modified after creation and **immutable** objects cannot.

Containers are objects that can be used to group other objects together. Some useful container types are:

- **`str`** (string: immutable; indexed by integers; items are stored in the order they were added)
- **`list`** (list: mutable; indexed by integers; items are stored in the order they were added)
  - `[3, 5, 6, 3, 'dog', 'cat', False]`
- **`tuple`** (tuple: immutable; indexed by integers; items are stored in the order they were added)
  - `(3, 5, 6, 3, 'dog', 'cat', False)`
- **`set`** (set: mutable; not indexed at all; items are NOT stored in the order they were added; can only contain immutable objects; does NOT contain duplicate objects)
  - `{3, 5, 6, 3, 'dog', 'cat', False}`
- **`dict`** (dictionary: mutable; key-value pairs are indexed by immutable keys; items are NOT stored in the order they were added)
  - `{'name': 'Jane', 'age': 23, 'fav_foods': ['pizza', 'fruit', 'fish']}`

When defining lists, tuples, or sets, use commas (,) to separate the individual items. When defining dicts, use a colon (:) to separate keys from values and commas (,) to separate the key-value pairs.

Strings, lists, and tuples can use all the `+`, `*`, `+=`, and `*=` operators. 

You can modify items in lists and tuples by using the index of the value. 
- list[0] is the first value in a list (programming basically always starts at 0 index!). 

You can modify items in a dictionary by using the key for that item:
- dict[key] is the dictionary item with key "key"

In [8]:
# Assign some containers to different variables
list1 = [99, "Bottles ", "of pop ", "on the wall"]
tuple1 = (99, "Bottles ", "of pop")
dict1 = {'Number of bottles of pop on the wall': 98}

In [9]:
# Items in the list object are stored in the order they were added
list1

[99, 'Bottles ', 'of pop ', 'on the wall']

In [10]:
# Items in the tuple object are stored in the order they were added
tuple1

(99, 'Bottles ', 'of pop')

In [11]:
# Items in the dict object are not stored in the order they were added
dict1

{'Number of bottles of pop on the wall': 98}

In [12]:
# You can change a list item
list1[0] = 98
list1
# But you CAN'T change a tuple item

[98, 'Bottles ', 'of pop ', 'on the wall']

In [13]:
# Re-assign a dict item
dict1["Number of bottles of pop on the wall"] = 96
dict1

{'Number of bottles of pop on the wall': 96}

### Functions
A Python function is written like this:

```python
def square(x):
    return x**2
```
    
The name of the function is square, x is the input variable, and the return keyword tells us what to give as output.

In [14]:
# Exercise 1. Define a function called weight_conversion,that takes one variable (x) in pounds
# uses the conversion formula Kilograms = pounds/2.2 and returns the weight in kilograms.

#### YOUR CODE STARTS HERE ####
#### YOUR CODE ENDS HERE ####

print("Testing:")
for x in [10, 22, 180, 0]:
    print(str(x), " -> ", str(weight_conversion(x)))
    print("CORRECT" if weight_conversion(x)==(x/2.2) else "INCORRECT")

Testing:


NameError: name 'weight_conversion' is not defined

### If-else statements

An if/else statement looks like this:

```
if electoral_votes >= 270:
    print("You win the election")
else:
    print("You lose the election")
```

The if-statement is evaluated (`electoral_votes >= 270`); if it's true then the code under the `if` is executed, if it's false then the code under the `else` is executed.

In [None]:
# Exercise 2. Define a function called "contains_ss" that takes one variable (word) 
# and returns True if the word contains a double-s and False if it doesn't.
# Hint: to test whether a string e.g. "ss" is inside another string variable e.g. word, write
#    if "ss" in word: 

#### YOUR CODE STARTS HERE ####


#### YOUR CODE ENDS HERE ####

print("Testing:")
for word in ["computer", "science", "lesson"]:
    print("{:s} ->".format(word, contains_ss(word)), end=' ')
    print("CORRECT" if contains_ss(word)==("ss" in word) else "INCORRECT")

### More complex if-else statements

Maybe you want to check *several* conditions? You can use an if/elif/else statement.

```
if teamA_score > teamB_score:
    print("Team A wins")
elif teamA_score < teamB_score:
    print("Team B wins")
else:
    print("It's a tie!")
```

`elif` stands for "else if". In fact, the above code is just a neater way of writing this:
```
if teamA_score > teamB_score:
    print("Team A wins")
else:
    if teamA_score < teamB_score:
        print("Team B wins")
    else:
        print("It's a tie!")
```

You can have as many `elif` statments as you like. These are useful for when you want several options.

In [None]:
# Exercise 3. Define a function called "grade" that takes one input (score).
# If score >= 90, return the string "A"
# Otherwise, if score >= 80, return the string "B"
# Otherwise, if score >= 70, return the string "C"
# Otherwise, if score >= 60, return the string "D"
# Otherwise, if score >= 50, return the string "E"
# Otherwise, return the string "F"

#### YOUR CODE STARTS HERE ####


#### YOUR CODE ENDS HERE ####


print("Testing:")
for (score,g) in [(77,"C"),(80,"B"),(32,"F"),(100,"A"),(69,"D")]:
    print("{:d} -> {:s}".format((score, grade(score))), end=' ')
    print("CORRECT" if grade(score)==g else "INCORRECT")

### Loops

Loops are a useful tool that lets you reuse blocks of code without having to retype everything. There are several kinds of loops. The difference between them is what controls how many times they run: 
- **While** loops run as long as the condition at the top is True. That means they can run forever if you're not careful!
- **For** loops run for as many times as there are objects in the container at the top of the loop.

In [None]:
# An example of a while loop. What is the condition? What would happen if you didn't subtract one from index each time?
index = 5
while index > 0:
    print(index)
    index -= 1

In [None]:
# An example of a for loop. What is the collection?
my_list = [5, 4, 3, 2, 1]
for number in my_list:
    print(number)

A handy trick with loops is the range function. Say you want your loop to run 5 times, but you don't want to have to make a list with 5 numbers. You can do this:

In [None]:
# The syntax for the range function is range(start, end, increment). 
# If you only include one argument, the function will start at 0 and go until that number, increasing by 1 each time. 
for number in range(5, 0, -1):
    print(number)

### Python built-in functions

A **function** is a Python object that you can "call" to **perform an action** or compute and **return another object**. You call a function by placing parentheses to the right of the function name. Some functions allow you to pass **arguments** inside the parentheses (separating multiple arguments with a comma). Internal to the function, these arguments are treated like variables.

Python has several useful built-in functions to help you work with different objects and/or your environment. Here is a small sample of them:

- **`type(obj)`** to determine the type of an object
- **`len(container)`** to determine how many items are in a container
- **`sorted(container)`** to return a new list from a container, with the items sorted
- **`sum(container)`** to compute the sum of a container of numbers
- **`min(container)`** to determine the smallest item in a container
- **`max(container)`** to determine the largest item in a container
- **`abs(number)`** to determine the absolute value of a number
- **`repr(obj)`** to return a string representation of an object

> Complete list of built-in functions: https://docs.python.org/3/library/functions.html


In [None]:
# Try out a few of the builtin methods below! See if you can predict the output for each one. 

## Importing modules

- Modules are pre-packaged groups of Python files that you can import
- After importing a module, you can use its functions without having to write them yourself
- Here is a simple example of importing and using a module called numpy, which provides support for lots of different calculations and mathematical data structures:

In [None]:
# This line does the importing!
# The format is: "import (packagename) as (name you want to use for package in code).
# If you leave the second part out, it will be called by its original name
import numpy as np

# Here is an example of a numpy ndarray, a really useful data structure to understand. It is an array of any dimensions,
# Usually used to hold numbers. Numpy provides a lot of useful mathematical tools to work on arrays. 

ex_array = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(type(ex_array))            # Prints "<class 'numpy.ndarray'>"
print(ex_array.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"
ex_array[0,0] = 3
print(ex_array)

- To learn about what different functions or objects a module contains, you have to consult its documentation. For example, here is a link to the numpy documentation: https://docs.scipy.org/doc/numpy-1.14.0/reference/
- Try picking a new function from the documentation and applying it to ex_array. What kinds of interesting things can you do?

In [None]:
# Your code:

## Data
In machine learning, data is represented in a tabular format. Consider the example of predicting whether an individual who visits an online book seller is going to buy a specific book. This prediction can be performed by analyzing the individual’s interests and previous purchase history. For instance, when John has spent a lot of money on the site, has bought similar
books, and visits the site frequently, it is likely for John to buy that specific book. John is an example of an instance. Instances are also called points, data points, or observations. A dataset consists of one or more instances:

<img src="images/tb1.png">

A dataset is represented using a set of features, and an instance is represented using values assigned to these features. Features are also known as *measurements* or *attributes*. In the above example, the features are Name, Money Features, Measurements, or Attributes Spent, Bought Similar, and Visits; feature values for the first instance are John, High, Yes, and Frequently. Given the feature values for one instance, one tries to predict its class (or class attribute) value. In our example, the class attribute is Will Buy, and our class value prediction for first instance is Yes. An instance such as John in which the class attribute value is unknown is called an unlabeled instance. Similarly, a labeled instance is an instance in which the class attribute value in known. Mary in this Labeled and dataset represents a labeled instance. The class attribute is optional in a Unlabeled dataset and is only necessary for prediction or classification purposes. One can have a dataset in which no class attribute is present, such as a list of customers and their characteristics.

There are different types of features based on the characteristics of the feature and the values they can take. For instance, Money Spent can be represented using numeric values, such as $25. In that case, we have a continuous feature, whereas in our example it is a discrete feature, which can take a number of ordered values: {High, Normal, Low}.

### Different Types of Data

1. **Nominal (categorical)**. These features take values that are often represented as strings. For instance, a customer’s name is a nominal feature. In general, a few statistics can be computed on nominal features. Examples are the chi-square statistic (χ2) and the mode (most common feature value). For example, one can find the most common first name among customers. The only possible transformation on the data is comparison. For example, we can check whether our customer’s name is John or not. Nominal feature values are often presented in a set format.

2. **Ordinal**. Ordinal features lay data on an ordinal scale. In other words, the feature values have an intrinsic order to them. In our example, Money Spent is an ordinal feature because a High value for Money Spent is more than a Low one.

3. **Interval**. In interval features, in addition to their intrinsic ordering, differences are meaningful whereas ratios are meaningless. For interval features, addition and subtraction are allowed, whereas multiplications and division are not. Consider two time readings: 6:16 PM and 3:08 PM. The difference between these two time readings is meaningful (3 hours and 8 minutes); however, there is no meaning to $\frac{6:16 PM}{3:08 PM} \neq 2$.

4. **Ratio**. Ratio features, as the name suggests, add the additional properties of multiplication and division. An individual’s income is an example of a ratio feature where not only differences and additions are meaningful but ratios also have meaning (e.g., an individual’s income can be twice as much as John’s income).

### Example
Study the below table and list the type of each feature. Explain.

<img src="images/tb2.png">

In [None]:
# Your answer:
# 1. Outlook:
# 2. Temperature:
# 3. Humidity:
# 4. Windy:
# 5. Play:

## Dataset
To download the data go to [HERE](link) and download the following file:

    - heart.csv

This file contain the data for the heart disease in the "csv" format. "csv" stands for "comma-separate values". We'll use this information later, when we tell the program how to load this data.

PRO TIP: Make sure that the downloaded datasets and this jupyter notebook are in the same directory (folder), else you will have problems later.

Open the file (with Microsoft Excel) and understand what the data contains.
Try to answer the following questions:

    - What features/attributes are provided?
    - What each line represents?


In [None]:
# Write your answers in comments below (1 line, each):
# Answer to Q1.
# Answer to Q2.

The [original data](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) contains 76 attributes, but all published experiments refer to using a subset of 14 of them.

Go to above link (Click on "original data"), read out the descriptions about the features.

Find out what are the types of the data (this is important for future coding and improving the accuracy of the data)

In [1]:
#Your answer (e.g: feature1: nominal, because we can only compare the values)

### Load the Data
To load the data we should import the needed packages/libraries for our dataset.

In [None]:
import pandas as pd

Run the following code to load the data you downloaded into DataFrames.

In [None]:
data = pd.read_csv("heart.csv")

In [None]:
data.head()

What do you think the head() function do?

In [None]:
# Write down your answer in comments (one line):
#

In [None]:
data.iloc[0]

In [None]:
data.iloc[0][0]

What does .iloc return?

In [None]:
# Write down your answer in comments (one line):
#

### Layout of the Data
The ``.columns`` parameter of a DataFrame tells us the name of the columns. Run the following cells to examine the column names of the DataFrame we just created.

In [None]:
data.columns

Now that we've looked at the columns of the dataset, let's look at the rows. How many rows are in each dataset? We can use the ``.shape`` parameter to tell us about the number of rows in the dataset. Can you guess what the second number, returned by ``.shape``, corresponds to?

In [None]:
data.shape

In [None]:
# Your answer:

**Exercises:**
- Divide the content of even rows by 2. Print the 5 first rows.

In [None]:
# Your code

- Multiply the content of odd columns by 3. Print the 5 first rows.

In [None]:
# Your Code

### Research Time:
- Search what other methods besides dataframes can be use to load the ``.csv`` files. If there is any, find out how we can access the rows and columns with that method.

In [None]:
# Your code:

- Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. Search what types of missing data we have and how we can handle the missing data.

In [None]:
# Your answer (3-4 line):

- Noisy data is data with a large amount of additional meaningless information in it called noise. Find the difference between noise and outliers. Mention a method that can be used to eliminate noise.

In [None]:
# Your answer (3-4 line):

**Coding Activity:**

[The Cat and Mouse Game](http://tangra.cs.yale.edu/naclo/practice/2012A.html)

Write a program that will able to understand coded messages based on the above pattern.

In [None]:
# Your code:

In [None]:
print("Nice work today!")

References:
    - Princeton AI4ALL NLP Project
    - Zafarani, Reza, Mohammad Ali Abbasi, and Huan Liu. Social media mining: an introduction. Cambridge University Press, 2014.