# Pandas (E) Data Analysis Workshop
(ft. Kyle Sorensen)

Now more than ever, data has a unique ability to describe our physical, social and technological world. Thanks to relatively recent (and accelerating) advances in computing, those with proper motivation can use data to affect change in any realm they so choose, from ones more directly related to computer science such as machine learning to others with not so obvious connections such as biology and economics!

You have likely heard a lot of talk recently about *data science*, but what exactly is data science? A quick search gives this definition: "an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge from data across a broad range of application domains". Now, this is a mouthful, but essentially, data science pulls from the best of computer science, statistics, and mathematics to <b>make data understandable and actionable</b>. This is the key, because what worth does data have if we can not interpret it and make decisions based on it?

## `pandas` and Data Science in Python:
The industry standard for this type of work is a library called `pandas`, along with a dependency that will come in handy called `numpy` and a lovely data viz tool called `seaborn`. There are other libraries that work well with data science in Python that we may explore later, but these are all you will need for now! :)

Below, you will see the usual convention for importing `pandas`, `numpy`, and `seaborn`:

In [26]:
import pandas as pd
import numpy as np
import seaborn as sns

## `pandas` Basics
Before we get our hands on some real data, let's go over the basics of `pandas` including...

* The data structures of `pandas`
* Viewing data and metadata
* Importing and exporting data
* Indexing and selecting data
* Merge, join, concatenate and compare
* Group by and summarization
* Reshaping data
* Time series functionality

In [30]:
# There are two primary data structures in pandas -
# the `DataFrame` and the `Series`
df = pd.DataFrame({'A': [1, 2, 3],
                   'B': [4, 5, 6],
                   'C': [7, 8, 9]})
# note: the conceptual structure of a DataFrame 
# can be thought of as a dictionary or hashmap
# with string keys and list values
s = pd.Series([1, 2, 3])
print(s)
s = df.A
print(s)
# note: each column in a DataFrame is a labeled
# Series object, this is the link between the two 
# basic data structures in pandas

0    1
1    2
2    3
dtype: int64
0    1
1    2
2    3
Name: A, dtype: int64


In [31]:
# There are three functions of interest here:
print(df.head(2)) # head(n) displays the FIRST n rows of data
                    # from our DataFrame object
print(df.tail(2)) # tail(n) displays the LAST n rows of data
                    # from our DataFrame object
print(df.info()) # displays metadata for each column in our
                    # DataFrame object (null info!!)

   A  B  C
0  1  4  7
1  2  5  8
   A  B  C
1  2  5  8
2  3  6  9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
 2   C       3 non-null      int64
dtypes: int64(3)
memory usage: 200.0 bytes
None


In [None]:
# Import

# Export


## Our Dataset (source: [rashida048](https://github.com/rashida048/Datasets/blob/master/home_data.csv))
The dataset we will be using for this workshop contains pricing data on 21,613 homes with variables such as `yr_renovated` which is the year of most recent renovation if available and 0 otherwise, and `waterfront` which is an indicator variable for whether the property is located close to a body of water. The reason for using `pandas` here over something more user-friendly like MS Excel is that our data is 3.5+ MB, making it quite cumbersome to work with in a spreadsheet.

To get started, we will load our data and use the `head(n)` and `info()` methods to display the first 10 rows of data along with a summary of the columns, including data types.

In [27]:
home_data = pd.read_csv('home_data.csv')
home_data.head(10)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
5,7237550310,20140512T000000,1225000,4,4.5,5420,101930,1.0,0,0,...,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930
6,1321400060,20140627T000000,257500,3,2.25,1715,6819,2.0,0,0,...,7,1715,0,1995,0,98003,47.3097,-122.327,2238,6819
7,2008000270,20150115T000000,291850,3,1.5,1060,9711,1.0,0,0,...,7,1060,0,1963,0,98198,47.4095,-122.315,1650,9711
8,2414600126,20150415T000000,229500,3,1.0,1780,7470,1.0,0,0,...,7,1050,730,1960,0,98146,47.5123,-122.337,1780,8113
9,3793500160,20150312T000000,323000,3,2.5,1890,6560,2.0,0,0,...,7,1890,0,2003,0,98038,47.3684,-122.031,2390,7570


In [28]:
home_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  int64  
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

## Tasks for This Workshop
Using the skills we learned above along with some additional functionality from other libraries, we will complete the following tasks...
* Generate a pivot table displaying average home price w.r.t. the number of bathrooms and number of bedrooms
* Construct a time series model for housing prices with a 24 month forecast
* Construct a heatmap of housing prices using latitude and longitude coordinates