# Course Overview and Background

The goal of this course is to help students acquire a basic level of proficiency in working with and analyzing
data using contemporary tools and techniques. Students will gain a hands-on understanding of classical data analysis
techniques and will develop proficiency in applying these techniques in a modern programming language
(Python). Lectures will go over skills that help (a) understand the practical settings in which various methods are
useful and (b) interpret the results (and assess the significance) of analyses.

Students who successfully complete this course will be proficient in basic data acquisition, manipulation,
and analysis techniques. They will be able to understand and carry out commonly used methods for clustering,
classification, and regression analysis. They will be able to interpret their results, and discuss the limitations of their methodology. They will also understand and be able to articulate some efficiency and systems issues related to
working with very large datasets.

The two paradigms/frameworks that underlie many data science tools and techniques we will see in the course are (1) computational approaches to data representation and processing and (2) linear algebra (as a foundation for statistics, probability, and machine learning approaches). We will be using a programming language and associated frameworks that allow us to combine these two in a streamlined way.

Data sets can often be viewed as points in a multidimensional vector space (though sometimes one or more dimensions in that space may be discrete rather than continuous. One fundamental component of data analytics workflows involves starting with a data set in this form that is rich and complex (in terms of the number of attributes/dimensions) and converting it into something simpler that allows mathematical and computational techniques to be more effective.

The techniques we will introduce in the course (built on top of linear algebra) that serve this purpose can be broken down into the following categories:

* Distance and similarity
* Dimensionality reduction
* Clustering
* Classification
* Regression

We will also introduce mathematically rigorous methods for assessing, characterizing, and interpreting the significance and validity of analysis results.

# Introduction/Review of Python 

In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
# import libraries
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
from importlib import reload
from datetime import datetime
from IPython.display import Image
from IPython.display import display_html
from IPython.display import display
from IPython.display import Math
from IPython.display import Latex
from IPython.display import HTML
print('')




A number of different languages and associated packages are commonly used by contemporary data scientists to perform tasks and develop workflows. Examples include R and Python. In this course, we will be using Python 3 (together with a few common environment tools and libraries).

Because not everyone will have the same level of familiarity with Python, we will review several common data structures (and associated functions) that are natively supported. 

Python is an interpreted language that supports both imperative and functional programming styles. This makes it well-suited both for rapid and interactive prototyping (e.g., of sequential data retrieval and processing workflows) and for working with mathematical objects (e.g., vectors, vector spaces, functions, and so on). The Python community has developed a rich and extensive collection of tools and libraries.


There are four ways to run Python code:

* Put your code in a file (say `example.py`) and run `python example.py`
* Type your code into the python interpreter
 * this allows you to interact with the interpreter and fix mistakes as they happen
 * however, commands must be retyped
* Type, cut/paste, or run your code in `ipython`
* Run `ipython` in a browser using `jupyter notebook`
 * like `ipython` but with interleaved documentation and graphical output
 * these slides are being presented in this way

## Basic Data Structures Native to Python

Python has a number of built-in data types, including:

* Strings
* Integers
* Floats
* Booleans

In [8]:
a = 7
type(a)

int

In [9]:
b = 3
type(b)

int

In [10]:
c = 3.2
type(c)

float

In [11]:
d = True
type(d)

bool

Note that Python **does not require explicitly declared variable types** like C, Java, and other languages. Python is dynamically typed (i.e., the type of a variable is determined when the code runs).

In [12]:
myVar = 'I am a string.'
print(myVar)
myVar = 2.3
print(myVar)

I am a string.
2.3


### Strings

When dealing with unstructured or partially structured data, string manipulation may be important. A string uses either single quotes or double quotes.  Pick one option and be consistent.

In [None]:
'This is a string.'

In [None]:
"This is also a string."

The `+` operator concatenates strings.

In [None]:
a = "Hello"  
b = " World" 
a + b

Portions of strings can be extracted and manipulated using indexing (which python calls "slicing").

In [13]:
a = "World"

a[0]

'W'

In [14]:
a[-1]

'd'

In [15]:
"World"[0:4]

'Worl'

In [16]:
a[::-1]

'dlroW'

There are a number of useful functions that allow us to work with string in a concise way.

In [17]:
a = "Hello World"
"-".join(a)

'H-e-l-l-o- -W-o-r-l-d'

In [18]:
a.startswith("Wo")

False

In [19]:
a.endswith("rld")

True

In [20]:
a.replace("o","0").replace("d","[)").replace("l","1")

'He110 W0r1[)'

In [21]:
a.split()

['Hello', 'World']

In [22]:
a.split('o')

['Hell', ' W', 'rld']

Strings are an example of an **immutable** data type.  Once you instantiate a string, you cannot change it. 

In [23]:
string = "string"
string[-1] = "y"  # This generates an error as we attempt to modify the string.

TypeError: 'str' object does not support item assignment

To create a string with embedded objects, we can use the `.format()` method.

In [None]:
course_name = 'CS505'
enrollment = 75
percent_full = 100.0
'The course {} has an enrollment of {} and is {} percent full.'.format(
    course_name,enrollment,percent_full)

## Higher-order Data Structures: Tuples, Lists, Sets, and Dictionaries

The base data structures allow us to represent individual values. A number built-in data structures allow us to work with multiple values and relationships between values.

### Tuples

Tuples make it possible to put multiple values (even of diffent types) together in a fixed order. Tuples are **immutable** (like strings): they cannot be modified once they are created. Below is a tuple of several values.

In [None]:
groceries = ('orange', 'meat', 'asparagus', 2.5, True)
groceries

In [None]:
groceries[2]

What will happen here?

In [None]:
groceries[2] = 'milk'

### Lists

A list is an *ordered* collection of values which can be of differing types and in which duplicates are allowed. Lists differ from tuples primarily in that they can be modified after they are created. Below, we create an empty list.

In [25]:
groceries = []

Below, we add some items to the list.

In [26]:
groceries.append("oranges")  
groceries.append("meat")
groceries.append("asparagus")
groceries

['oranges', 'meat', 'asparagus']

We can access individual elements in a list by using their index (which begins at `0`).

In [27]:
groceries[0]

'oranges'

In [28]:
groceries[2]

'asparagus'

In [29]:
len(groceries)

3

We can also modify a list by sorting it.

In [30]:
groceries.sort()
groceries

['asparagus', 'meat', 'oranges']

We can remove an element from a list.

In [None]:
groceries.remove('asparagus')
groceries

Because lists are mutable, you can arbitrarily modify them.

In [None]:
groceries[0] = 'peanut butter'
groceries

### Sets

A set is an *unordered* collecton of items that *cannot contain duplicates*. These correspond to mathematical sets in most relevant ways.

In [None]:
numbers = range(10)
numbers = set(numbers)

evens = {0, 2, 4, 6, 8}

odds = numbers - evens
odds

Sets also support common binary operators such as union (`|`) and intersection (`&`).

### Dictionaries 

A dictionary is a finite map from a set of keys to values (or any Python objects), with one value corresponding to each key. Note that *keys must be unique*.

In [None]:
simple_dict = {}

simple_dict['cs506'] = 'data-mining tools'

simple_dict['cs506']

Below we create a dictionary. Note the use of curly braces and the colon.

In [3]:
classes = {
    'cs506': 'data-mining tools',
    'cs131': 'algorithms'
}

We can use the `in` operator to check if a key is mapped by the dictionary to a value.

In [4]:
'cs530' in classes

False

We can add a new key-value pair to the dictionary.

In [5]:
classes['cs530'] = 'graduate algorithms'
classes['cs530']

'graduate algorithms'

We can get the collection of keys from a dictionary.

In [6]:
classes.keys()

dict_keys(['cs506', 'cs131', 'cs530'])

We can also get the collection of values from a dictionary.

In [7]:
classes.values()

dict_values(['data-mining tools', 'algorithms', 'graduate algorithms'])

We can get the key-value entries in the dictionary as tuples.

In [9]:
classes.items()

dict_items([('cs506', 'data-mining tools'), ('cs131', 'algorithms'), ('cs530', 'graduate algorithms')])

The below code prints the key-value pairs in a dictionary.

In [10]:
for key, value in classes.items():
    print(key, value)

cs506 data-mining tools
cs131 algorithms
cs530 graduate algorithms


Dictionaries can be combined to make complex data structures. Below is a list within a dictionary within a dictionary.

In [11]:
professors = {
    "prof1": {
        "name": "Evimaria Terzi",
        "interests": ["algorithms", "data mining", "machine learning"]
    },
    "prof2": {
        "name": "Mark Crovella",
        "interests": ["computer networks", "data mining", "biological networks"]
    },
    "prof3": {
        "name": "George Kollios",
        "interests": ["databases", "data mining"]
    },
    "prof3": {
        "name": "Adam Smith",
        "interests": ["cryptography", "data privacy", "machine learning"]
    }
}

In [12]:
for prof in professors:
    print('{} is interested in {}.'.format(
            professors[prof]["name"],
            professors[prof]["interests"][0]))

Evimaria Terzi is interested in algorithms.
Mark Crovella is interested in computer networks.
Adam Smith is interested in cryptography.


### Comprehension Notation

Python supports **comprehension** notation, allowing us to concisely build new tuples, lists, sets, and dictionaries.

In [14]:
groceries = ['asparagus', 'meat', 'oranges']
vegetables = [x for x in groceries if x is not "meat"]
vegetables

['asparagus', 'oranges']

The above corresponds to the following imperative block of code.

In [15]:
newlist = []
for x in groceries:
    if x is not 'meat':
        newlist.append(x)
newlist

['asparagus', 'oranges']

Recall the mathematical notation:

$$L_1 = \left\{x^2 : x \in \{0\ldots 9\}\right\}$$

$$L_2 = \left\{1, 2, 4, 8,\ldots, 2^{12}\right\}$$

$$M = \left\{x \mid x \in L_1 \text{ and } x \text{ is even}\right\}$$

In [16]:
L1 = [x**2 for x in range(10)]
L2 = [2**i for i in range(13)]
print('L1 is {}'.format(L1))
print('L2 is {}'.format(L2))

L1 is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
L2 is [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]


In [17]:
M = [x for x in L1 if x % 2 == 0]
print('M is {}'.format(M))

M is [0, 4, 16, 36, 64]


In the example below we generate all composite numbers, remove them from the set of all numbers. What is left are the prime numbers.

In [18]:
composites = [i*j for i in range(2,8) for j in range(2,8)]

In [19]:
primes = [x for x in range(2,50) if x not in composites]
print(primes)

[2, 3, 5, 7, 11, 13, 17, 19, 22, 23, 26, 27, 29, 31, 32, 33, 34, 37, 38, 39, 40, 41, 43, 44, 45, 46, 47, 48]
