# Exploratory Data Analysis (EDA) - Day 3

This notebook will cover the fundamental data-types in a dataset, and take a deep-dive into understanding the components of a dataset.

#### **Topics:**


1.   Data Structures & Basic Functions 
2.   Intermediate Functions
3.   Read-in Data
4.   Explore Dataset Features & Target
5.   Conclusion


#### **Goals:**


1.   Understand Data Structures in Python
2.   Define Functions in Python
3.   Read-in a Dataset with Python
4.   Understand The Components of a Dataset
5.   Examine Datasets with Python



## Import Open-Source Packages

This is something you will do at the beginning of every script. Import your open-source (or local) code.



1.   **Pandas:** Working with datasets. Arguably the most widely-used data-science Python package.
2.   **NumPy:** Scientific computing package for working with vectors & matrices. 
3. **MatplotLib:** Tool for dataset vizualizations.
4. **Seaborn:** Tool for dataset visualizations.



In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns

## Basic Functions & Data Structures

### Data Structures

In this section we will identify and work with basic data structures.

**Common Data Structures:**


1.   **Integer:** 1
2.   **Float:** 1.0
3.   **String:** 'hello'
4.   **List:** [1,2,2,4,5]
5.   **Set:** (1,2,3,4,5)
6.   **Tuple:** (1,12)
7.   **Dictionary:** {'key':value}




In [1]:
# Declaring variables

## TODO: YOUR CODE HERE ##

In [2]:
# Accessing Elements
## We use indexing, & Python indexing starts at 0

## TODO: YOUR CODE HERE ##

In [3]:
# Accessing Dictionary Elements
## We use the key to access the value

## TODO: YOUR CODE HERE ##

In [4]:
# Creating Arrays (vectors)
## TODO: YOUR CODE HERE ##

# Creating Matrices
## TODO: YOUR CODE HERE ##

# Print array, type, shape
## TODO: YOUR CODE HERE ##

### Object Methods

Notice above, when we printed the type of each variable. They are each a class. In object-oriented-programming (OOP), **classes have functions called methods.**

In [None]:
# List methods
## TODO: YOUR CODE HERE ##

# Set methods
## TODO: YOUR CODE HERE ##

# Dict methods
## TODO: YOUR CODE HERE ##

### Functions

In this section we will define and call basic functions.

**Components of Functions:**


1.   **def:** Keyword used to indicate we are defining a function
2.   **Args/Parameters:** The inputs to a function
3.   **return:** Keyword used to indicate the return value of a function
4.   **Docstring:** String inside a function indicating what the function does, returns, and the param types


In [6]:
# Defining Functions - 1

def add(x, y):
  """
  This Function Returns The Sum of The Inputs

  Args:
    -- 'x': integer
    -- 'y': integer

  Returns:
    -- integer
  """
  ## TODO: YOUR CODE HERE ##
    

    

# Calling Functions
print(add(2,6))
print(add(22,12))
print(add(56,611))

In [7]:
# Defining Functions - 2
## Functions don't always require you to pass in args
## We can also set the the functions return value to a variable

def create_random_array():
  """
  This Function Creates a Random Array (matrix)

  Args:
    -- None

  Returns:
    -- np.array 
  """
  ## TODO: YOUR CODE HERE ##

def create_specific_array(x,y):
  """
  This Function Creates a Random Array (matrix) w/ Shape (x,y)

  Args:
    -- 'x': 
      - num of samples in array
      - type int
    -- 'y':
      - num of dimensions in array
      - type int

  Returns:
    -- np.array with shape (x,y)
  """

  ## TODO: YOUR CODE HERE ##

    
# Calling functions
new_array1 = create_random_array()
new_array2 = create_specific_array(3,3)

# Examine return values
## TODO: YOUR CODE HERE ##

In [8]:
# Defining Functions - 3
## Functions can return multiple values!

def add_subtract(x,y,z):
  """
  This Function Performs Simple Add & Subtraction on Inputs

  Args:
    -- 'x': type int
    -- 'y': type int
    -- 'z': type int

  Returns:
    -- 'a': 
      - (x+z)
      - type int
    -- 'b':
      - (y-z)
      - type int
    -- 'c':
      - (x+y+z)
      - type int
  """

  ## TODO: YOUR CODE HERE ##



# Calling functions
a,b,c = add_subtract(2,5,10)
print('a:', a)
print('b:', b)
print('c:', c)

In [9]:
# Defining Functions - 4
## Functions can call other functions!
## 'create_random_array()' is actually calling a function as well: np.random.randint()

def add_two(x):
    """
    This function adds two to x
    
    Args:
        - 'x': int or float
        
    Returns:
        - 'x+2': int or float
    """
    ## TODO: YOUR CODE HERE ##

def subtract_ten(x):
    """
    This function subtracts ten from x
    
    Args:
        - 'x': int or float
        
    Returns:
        - 'x-10': int or float
    """
    ## TODO: YOUR CODE HERE ##

def function_call(x):
    """
    This function calls 'add_two' and 'subtract_ten'
    
    Args:
        - 'x': int or float
        
    Returns:
        - 'x+2-10': int or float
    """
    ## TODO: YOUR CODE HERE ##
    

# Calling functions
print(function_call(12))
print(function_call(123))
print(function_call(1))

## Intermediate Functions

In this section we will define more complex functions. Such as working with loops and accessing elements from objects.

In [10]:
# Intermediate Functions - 1
## Access elements from a list

def loop_over(l):
    """
    Loop over elements in a list 'l'

    Args:
      - 'l': list
    """
    ## TODO: YOUR CODE HERE ##

def add_two_inplace(l):
    """
    Add two inplace to a list 'l'
    
    Args:
        - 'l': list
    """
    ## TODO: YOUR CODE HERE ##

def add_two_outplace(l):
    """
    Add two out-of-place to a list 'l'
    
    Args:
        - 'l': list
    """
    ## TODO: YOUR CODE HERE ##

    
# Calling functions
## TODO: YOUR CODE HERE ##

In [11]:
# Intermediate Functions - 2
## Accessing elements from a dictionary
## Conditional Statements (if/else)

def show(d):
    """
    Loop over the key, value pairs in a dict 'd'

    Args:
      - 'd': dict
    """
    ## TODO: YOUR CODE HERE ##

def show_values(d):
    """
    Loop over the keys to print just the values in a dict 'd'

    Args:
      - 'd': dict
    """
    ## TODO: YOUR CODE HERE ##

def key_from_val(d, value):
    """
    Loop over the key, value pairs in a dict 'd' to print just the keys

    Args:
      - 'd': dict
    """
    ## TODO: YOUR CODE HERE ##


# Calling functions
d = {'key1':[1,2,3,4,5], 'key2':12, 'key3':'this is a string', 
     'key4':[55,7,85]}

key_from_val(d, 12)

In [12]:
# Intermdiate Functions - 3
## Putting it all together

def show_even_numbers_loop(l): # version one (with a loop)
  """
  This function takes in a list and 
  returns a new list containing all even numbers
  using a loop

  Args:
    -- 'l': list containing integers

  Returns:
    -- new list containing all even numbers
  """

  ## TODO: YOUR CODE HERE ## 

def show_even_numbers_comp(l): # version two (with list comprehension)
  """
  This function takes in a list and 
  returns a new list containing all even numbers
  using list comp

  Args:
    -- 'l': list containing integers

  Returns:
    -- new list containing all even numbers
  """

  ## TODO: YOUR CODE HERE ##

    
# Calling functions
l = [1,2,3,4,5,6,7,8,9,10]

loop_result = show_even_numbers_loop(l)
comp_result = show_even_numbers_comp(l)

print(loop_result)
print(comp_result)

## WORKING WITH DATASETS

In this section we will be learning how to read-in and work with different types of data.

#### **Introduction**

**Dataframe (CSV)**

In [13]:
# Read in csv, show head
## TODO: YOUR CODE HERE ##

**Text Data**

In [None]:
# Read in txt file as str
## TODO: YOUR CODE HERE ##

**Image Data**

In [14]:
# Read in image using open-cv, show image
## TODO: YOUR CODE HERE ##

#### **EDA (Exploratory Data Analysis)**

In this course, we will mainly be working with Dataframes (CSV files). 

In [None]:
## Explore different datasets

# Read in email/iris/heart csv, show head
## TODO: YOUR CODE HERE ##

In [15]:
## This dataset contains information on flower species

print('SHAPE: ', iris.shape)
iris.head()

In [16]:
## This dataset contains information on emails, and classifying them as spam or not
## The features are how many times that word appears in the email

print('SHAPE: ', email.shape)
email.head()

In [17]:
## This dataset contains information on human health
## The goal is to predict the probability of heart disease based on your other health factors

print('SHAPE: ', heart.shape)
heart.head()

In [18]:
## Examine the different features and data-types

iris.info()

In [19]:
## Examining Numerical Features

iris.describe().T

In [20]:
## Accessing Features - 1
## One column

## TODO: YOUR CODE HERE ##

In [21]:
## Accessing Features - 2
## Multiple columns

## TODO: YOUR CODE HERE ##

In [22]:
## Accessing Features - 3
## Using 'loc'

## TODO: YOUR CODE HERE ##

In [23]:
## Accessing Features - 4
## Using 'iloc'

## TODO: YOUR CODE HERE ##

In [24]:
## Visualize Features - 1

def plot(col):
    """
    Plot the histogram of a single column
    """
    ## TODO: YOUR CODE HERE ##

def plot_multiple_hist(frame, col1, col2):
    """
    Plot the histogram of a multiple columns
    """
    fig, ax = plt.subplots(1, 2, figsize=(10,4))
    sns.histplot(data=frame, x=col1, ax=ax[0])
    sns.histplot(data=frame, x=col2, ax=ax[1]);

plot_multiple_hist(iris, 'sepal_length', 'petal_length')

In [25]:
## Visualize Features - 2
## Correlation & Heatmap

iris.drop(columns='species', inplace=True) # we only want to do this operation on features

corr = iris.corr(method='spearman')
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidths=0.01,square=True,annot=True,cmap='RdBu',linecolor='black')

In [26]:
# Visualize Target

sns.countplot(data=iris, x='species')

## Conclusion

**Question 1**

Finish the following function

In [None]:
def multiply_by_three(x):
  """
  This function multiplies a number by three

  Args:
    -- 'x': int

  Returns:
    -- int
  """

**Question 2**

Create a random array with shape 3,5

In [None]:
# array = 

**Question 3**

Isolate a feature from a dataframe

In [None]:
# feature = 

**Question 4**

Plot a histogram of a feature

In [None]:
def plot(dataframe, column):
  """
  This function plots a histogram of the specified dataset & feature

  Args:
    -- 'dataframe': pd.DataFrame
    -- 'column': string of feature name

  Returns:
    -- matplotlib histogram plot
  """

**Question 5 - Bonus!**

Define a function to return all rows where a specified feature is less than or equal to 4.

The function has two args - the dataframe, and the column.

The function returns the dataframe where that feature is less than or equal to 4.

*Hint: We did this in the above 'EDA' section (accessing features)* 

In [None]:
## TODO: YOUR CODE HERE ##