<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#EDA-Intro" data-toc-modified-id="EDA-Intro-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>EDA Intro</a></span><ul class="toc-item"><li><span><a href="#Assignment" data-toc-modified-id="Assignment-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Assignment</a></span></li></ul></li><li><span><a href="#Model-Cleaning-1:-Variable-type" data-toc-modified-id="Model-Cleaning-1:-Variable-type-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model Cleaning 1: Variable type</a></span></li><li><span><a href="#Assignment" data-toc-modified-id="Assignment-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Assignment</a></span></li><li><span><a href="#Data-Cleaning-2:-Missing-Values" data-toc-modified-id="Data-Cleaning-2:-Missing-Values-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Data Cleaning 2: Missing Values</a></span><ul class="toc-item"><li><span><a href="#Assignment" data-toc-modified-id="Assignment-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Assignment</a></span></li></ul></li></ul></div>

# Imports

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import warnings
import psycopg2

ModuleNotFoundError: No module named 'psycopg2'

In [2]:
!pip install psycopg2

Collecting psycopg2
[?25l  Downloading https://files.pythonhosted.org/packages/5c/1c/6997288da181277a0c29bc39a5f9143ff20b8c99f2a7d059cfb55163e165/psycopg2-2.8.3.tar.gz (377kB)
[K    100% |████████████████████████████████| 378kB 746kB/s ta 0:00:01
[?25hBuilding wheels for collected packages: psycopg2
  Building wheel for psycopg2 (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/marshallmamiya/Library/Caches/pip/wheels/48/06/67/475967017d99b988421b87bf7ee5fad0dad789dc349561786b
Successfully built psycopg2
Installing collected packages: psycopg2
Successfully installed psycopg2-2.8.3
[33mYou are using pip version 19.0.3, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


# EDA Intro

EDA (Exploratory Data Analysis) is the first step to a Data science pipeline. Involves three main components: Clean, Explore, Feature Engineer.

Clean: Fix any problems with initial dataset. (changing datatypes, remove NANs, string replacements)

Explore: Create visualizations and statistics to show relationships between features.

Feature Engineering: Choosing most relevant features and creating new features base upon them.

## Assignment

1. What is the goal of EDA (exploratory data analysis)?
    -  Indentify and prepare features to be used for modeling
    
    
2. Suppose that you are given a dataset of customer product reviews for an e-commerce company. Each review is scored as a Likert-style survey item where 1 indicates a negative sentiment about the product and a 5 is positive. These reviews are collected on the company's website. 

    a. What problems do you expect to find in the raw data? 
        
        - Answers may be above the rating scale, come in the form of strings or floats and possibply numerically written out. There may also be trailing whitespace or random special characters.
    
    b. If your task is to build features that give information about customer sentiments, how would you approach this task and what kind of methods would you apply to accomplish it? 

        - Each product feature or review feature could have an average rating column. If the e-commerce company has multiple categories of products, averaging the ratings for each category could be useful. 
    
    c. Try to identify some potentially useful features that you might derive from the raw data. How would you derive them and how would you assess the usefulness of those features?
        
        - Depends on the purpose of the model. If the model is used for predicting the products rating, then splitting each product into different categories (e.g. fashion, gadgets, etc) and then using the categories desired for training.  

# Model Cleaning 1: Variable type

Zero point: Where measurement starts (ratio). The point where postive and negative separate (interval).

Continous datatypes:
    - Interval: Distance between points are standardized and scaled. No defined zero point which means data can't be multiplied or divided. (ex. temperature)  
    - Ratio: There is an absolute zero point. Which means no negative data. 

# Assignment

1. Consider the advantages and disadvantages of treating the Rank variable as categorical. Discuss your arguments with your mentor.
    
    - Using rating as categorical, each channel can be grouped into: top rankings, midranking and low ranking channels. For data exploration, the groups can be visualized separately to see the contrast between features. If the data is grouped by another feature, then numerical ratings should be used. Ratings could then be used for statistical analysis.
    
    
2. What are the types of the following variables?

    * Age: Continous ratio

    * Salary: Continous ratio

    * Revenue: Continous ratio

    * Customer type: Nominal categorical

    * Stock price: Continous ratio

# Data Cleaning 2: Missing Values

## Assignment

Approach to missing values:
    - Delete rows with nans: Use as last resort and only if dataset is large enough where dropping those rows wouldn't have a significant effect on data. 

    - Imputation: Fill nans with central tendency metrics. 
    - Interpolation: Use value from similar rows. Used commonly with time series data but can be used with categorical data. Data must be meaningfully ordered in order to fill nans with accurate numerical or categorical estimate. 


1. Determine all the variable types and find the fraction of the missing values for each variable.

2. Notice that the data has a time dimension (year). For this assignment, forget about time and treat all the observations as if they're from the same year. Choose a strategy to deal with the missing values for each variables. For which variables would filling in the missing values with some value make sense? For which might tossing out the records entirely make sense?

3. Now, take into account the time factor. Replicate your second answer but this time fill in the missing values by using a statistic that is calculated within the year of the observation. For example, if you want to fill a missing value for a variable with the mean of that variable, calculate the mean by using *only* the observations for that specific year.

4. This time, fill in the missing values using interpolation (extrapolation).

5. Compare your results for the 2nd, 3rd, and 4th questions. Do you find any meaningful differences?

In [None]:
postgres_user = 'dsbc_student'<br>
postgres_pw = '7\*.8G9QH21'<br>
postgres_host = '142.93.121.174'<br>
postgres_port = '5432'<br>
postgres_db = 'useducation'<br>

In [4]:
postgres_user = 'dsbc_student'
postgres_pw = '7\*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'useducation'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

edu_df = pd.read_sql_query('select * from useducation',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

OperationalError: (psycopg2.OperationalError) FATAL:  password authentication failed for user "dsbc_student"
FATAL:  password authentication failed for user "dsbc_student"
 (Background on this error at: http://sqlalche.me/e/e3q8)