# Outline of data analysis process

- Importing/Exporting data
- Exploring and cleaning data
- Reshaping (aka munging, manilpulating data)
- Plotting (exploratory) 

# Advanced analysis steps follow
- Inference testing
- NLP
- ML
- SNL
- ... etc

## before that:
- need to talk about indexing and slicing dataframes
- need to talk about positional and label indexing (loc vs iloc)
- filteration and projection
- joining (inner vs outer) 
- aggregations and transformation, (importance of unit of analysis)
- sorting data

then (1 week)
- cleaning data and exploring it
- duplicated data
- types of data
- missing data, what to do with it?
- storing the data after cleaning it

then (2 weeks or 1)
- plotting data

- need a case study 
- need a project like the case study

IMPORTANT: LEAVE DATA IMPORTING AND EXPORTING TO THE END, after midterm project! teach them the basics
# Importing/Exporting Data

- outline:
- overview of types: binary vs text
- Descriptions, benefits, what can we represent?
- importing and exporting in python
- What if data is not organized as rows and columns?
    - you organize it using lists and dictionaries, then convert to dataframe
- external data sources
- Look for data files, API (usually more up-to-date, not for analysis only, but building apps as well), scrape, manual data collection or outsourcing
- Introduction to APIs (twitter api, github, stackoverflow, and weather api)
    - learn how to read the documentation
    - restful apis (likely use this)
    - new graphQL api
    - learn about curl and browser exploration
    - response code, pagination, and authentication
- Introduction to scraping using beutifulsoup (websites with no api, e.g., local news papers, blogs ..etc)
- end result is a data frame or dataframes

Then its cleaning and joining

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("weather.csv")

In [55]:
df.append(df.iloc[1], verify_integrity=True)


ValueError: Indexes have overlapping values: [1]

In [93]:
df2 = df.set_index(['location','date'])

In [90]:
%matplotlib inline

In [115]:
df.plot(kind="scatter",y="wind")

ValueError: scatter requires and x and y column

In [111]:
df.date=pd.to_datetime(df.date)

In [116]:
df.duplicated?

In [117]:
df = pd.read_json("https://github.com/vega/vega-datasets/raw/gh-pages/data/cars.json")

In [130]:
df.dtypes

Acceleration        float64
Cylinders             int64
Displacement        float64
Horsepower          float64
Miles_per_Gallon    float64
Name                 object
Origin               object
Weight_in_lbs         int64
Year                 object
dtype: object

In [129]:
df[df.Horsepower.isnull()]

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
38,19.0,4,98,,25.0,ford pinto,USA,2046,1971-01-01
133,17.0,6,200,,21.0,ford maverick,USA,2875,1974-01-01
337,17.3,4,85,,40.9,renault lecar deluxe,Europe,1835,1980-01-01
343,14.3,4,140,,23.6,ford mustang cobra,USA,2905,1980-01-01
361,15.8,4,100,,34.5,renault 18i,Europe,2320,1982-01-01
382,20.5,4,151,,23.0,amc concord dl,USA,3035,1982-01-01


In [131]:
df.head()

Unnamed: 0,Acceleration,Cylinders,Displacement,Horsepower,Miles_per_Gallon,Name,Origin,Weight_in_lbs,Year
0,12.0,8,307,130,18,chevrolet chevelle malibu,USA,3504,1970-01-01
1,11.5,8,350,165,15,buick skylark 320,USA,3693,1970-01-01
2,11.0,8,318,150,18,plymouth satellite,USA,3436,1970-01-01
3,12.0,8,304,150,16,amc rebel sst,USA,3433,1970-01-01
4,10.5,8,302,140,17,ford torino,USA,3449,1970-01-01


In [139]:
df[(df.Miles_per_Gallon > df.Miles_per_Gallon.mean()) & (df.Origin == "USA")].Cylinders.value_counts()

4    60
6     5
8     2
Name: Cylinders, dtype: int64

In [140]:
df[(df.Miles_per_Gallon < df.Miles_per_Gallon.mean()) & (df.Origin == "USA")].Cylinders.value_counts()

8    101
6     69
4     12
Name: Cylinders, dtype: int64