# Phase 4 Project Notebook
- Author: Jonathan Holt
- Data Science Flex

## Business Problem
- What are the top 5 best zip codes for us to invest in?

## Questions to Answer

## What Models & Metrics I plan on using

## Helper Functions
- Functions provided by Flatiron

In [1]:
def get_datetimes(df):
    """
    Takes a dataframe:
    returns only those column names that can be converted into datetime objects 
    as datetime objects.
    NOTE number of returned columns may not match total number of columns in passed dataframe
    """
    
    return pd.to_datetime(df.columns.values[1:], format='%Y-%m')

In [2]:
def melt_data(df):
    """
    Takes the zillow_data dataset in wide form or a subset of the zillow_dataset.  
    Returns a long-form datetime dataframe 
    with the datetime column names as the index and the values as the 'values' column.
    
    If more than one row is passes in the wide-form dataset, the values column
    will be the mean of the values from the datetime columns in all of the rows.
    """
    
    melted = pd.melt(df, id_vars=['RegionName', 'RegionID', 'SizeRank', 'City', 'State', 'Metro', 'CountyName'], var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    return melted.groupby('time').aggregate({'value':'mean'})

## Loading Data

In [93]:
import pandas as pd

pd.set_option('display.max_rows', 1000) #change the amount of rows displayed

In [4]:
df = pd.read_csv("zillow_data.csv")

In [5]:
df.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


In [26]:
df.describe()

Unnamed: 0,RegionID,RegionName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
count,14723.0,14723.0,14723.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,...,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0
mean,81075.010052,48222.348706,7362.0,118299.1,118419.0,118537.4,118653.1,118780.3,118927.5,119120.5,...,273335.4,274865.8,276464.6,278033.2,279520.9,281095.3,282657.1,284368.7,286511.4,288039.9
std,31934.118525,29359.325439,4250.308342,86002.51,86155.67,86309.23,86467.95,86650.94,86872.08,87151.85,...,360398.4,361467.8,362756.3,364461.0,365600.3,367045.4,369572.7,371773.9,372461.2,372054.4
min,58196.0,1001.0,1.0,11300.0,11500.0,11600.0,11800.0,11800.0,12000.0,12100.0,...,14400.0,14500.0,14700.0,14800.0,14500.0,14300.0,14100.0,13900.0,13800.0,13800.0
25%,67174.5,22101.5,3681.5,68800.0,68900.0,69100.0,69200.0,69375.0,69500.0,69600.0,...,126900.0,127500.0,128200.0,128700.0,129250.0,129900.0,130600.0,131050.0,131950.0,132400.0
50%,78007.0,46106.0,7362.0,99500.0,99500.0,99700.0,99700.0,99800.0,99900.0,99950.0,...,188400.0,189600.0,190500.0,191400.0,192500.0,193400.0,194100.0,195000.0,196700.0,198100.0
75%,90920.5,75205.5,11042.5,143200.0,143300.0,143225.0,143225.0,143500.0,143700.0,143900.0,...,305000.0,306650.0,308500.0,309800.0,311700.0,313400.0,315100.0,316850.0,318850.0,321100.0
max,753844.0,99901.0,14723.0,3676700.0,3704200.0,3729600.0,3754600.0,3781800.0,3813500.0,3849600.0,...,18889900.0,18703500.0,18605300.0,18569400.0,18428800.0,18307100.0,18365900.0,18530400.0,18337700.0,17894900.0


In [7]:
#get_datetimes(df)

## Checking for Null Values

In [22]:
df.isnull().sum()

RegionID         0
RegionName       0
City             0
State            0
Metro         1043
CountyName       0
SizeRank         0
1996-04       1039
1996-05       1039
1996-06       1039
1996-07       1039
1996-08       1039
1996-09       1039
1996-10       1039
1996-11       1039
1996-12       1039
1997-01       1039
1997-02       1039
1997-03       1039
1997-04       1039
1997-05       1039
1997-06       1039
1997-07       1038
1997-08       1038
1997-09       1038
1997-10       1038
1997-11       1038
1997-12       1038
1998-01       1036
1998-02       1036
1998-03       1036
1998-04       1036
1998-05       1036
1998-06       1036
1998-07       1036
1998-08       1036
1998-09       1036
1998-10       1036
1998-11       1036
1998-12       1036
1999-01       1036
1999-02       1036
1999-03       1036
1999-04       1036
1999-05       1036
1999-06       1036
1999-07       1036
1999-08       1036
1999-09       1036
1999-10       1036
1999-11       1036
1999-12       1036
2000-01     

### Analysis

There are many Null Values. 
- For categorical data, 7% of Metro are null (1043 of 14,723).
- Dates from 1996 - mid 2003, also have 7% null values.
- Then it starts to get better. 6% null and decreasing.

What is my decision on null values?
- keep?
- delete?
- change (mean)?

## Analysis of Categories

In [8]:
cat_df = df.iloc[:,0:7]
cat_df.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank
0,84654,60657,Chicago,IL,Chicago,Cook,1
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2
2,91982,77494,Katy,TX,Houston,Harris,3
3,84616,60614,Chicago,IL,Chicago,Cook,4
4,93144,79936,El Paso,TX,El Paso,El Paso,5


In [54]:
cat_df = cat_df.astype('category')
cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   RegionID    14723 non-null  category
 1   RegionName  14723 non-null  category
 2   City        14723 non-null  category
 3   State       14723 non-null  category
 4   Metro       13680 non-null  category
 5   CountyName  14723 non-null  category
 6   SizeRank    14723 non-null  category
dtypes: category(7)
memory usage: 2.8 MB


In [55]:
cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   RegionID    14723 non-null  category
 1   RegionName  14723 non-null  category
 2   City        14723 non-null  category
 3   State       14723 non-null  category
 4   Metro       13680 non-null  category
 5   CountyName  14723 non-null  category
 6   SizeRank    14723 non-null  category
dtypes: category(7)
memory usage: 2.8 MB


In [78]:
print("RegionID unique values:", cat_df['RegionID'].nunique())
print("RegionName unique values:", cat_df['RegionName'].nunique())
print("City unique values:", cat_df['City'].nunique())
print("State unique values:", cat_df['State'].nunique())
print("Metro unique values:", cat_df['Metro'].nunique())
print("CountyName unique values:", cat_df['CountyName'].nunique())
print("SizeRank unique values:", cat_df['SizeRank'].nunique())

RegionID unique values: 14723
RegionName unique values: 14723
City unique values: 7554
State unique values: 51
Metro unique values: 701
CountyName unique values: 1212
SizeRank unique values: 14723


In [87]:
cat_df.sort_values(by = ['RegionName'], ascending = True).head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank
5850,58196,1001,Agawam,MA,Springfield,Hampden,5851
4199,58197,1002,Amherst,MA,Springfield,Hampshire,4200
11213,58200,1005,Barre,MA,Worcester,Worcester,11214
6850,58201,1007,Belchertown,MA,Springfield,Hampshire,6851
14547,58202,1008,Blandford,MA,Springfield,Hampden,14548


A google search shows that RegionName is the ZipCode for each Region. However, upon sorting, I discovered that any ZipCode beginning with a 0 was ignoring it and displaying as a 4 digit number. I will use the .str().zfill() method to ensure that all RegionNames are displaying the as 5 digits.

In [90]:
cat_df['RegionName'] = cat_df['RegionName'].astype(str).str.zfill(5)

In [95]:
cat_df.sort_values(by = ['RegionName'], ascending = True).head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank
5850,58196,1001,Agawam,MA,Springfield,Hampden,5851
4199,58197,1002,Amherst,MA,Springfield,Hampshire,4200
11213,58200,1005,Barre,MA,Worcester,Worcester,11214
6850,58201,1007,Belchertown,MA,Springfield,Hampshire,6851
14547,58202,1008,Blandford,MA,Springfield,Hampden,14548


That seems to have fixed it!

## Analysis of Data Values

In [116]:
df['RegionID_copy'] = df['RegionID']

In [117]:
df.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,RegionID_copy
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600,84654
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,310000,312500,314100,315000,316600,318100,319600,321100,321800,90668
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,320600,320200,320400,320800,321200,321200,323000,326900,329900,91982
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000,84616
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119400,120000,120300,120300,120300,120300,120500,121000,121500,93144


In [118]:
data_df = df.iloc[:, 7:]
data_df.head()

Unnamed: 0,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,1996-12,1997-01,...,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,RegionID_copy
0,334200.0,335400.0,336500.0,337600.0,338500.0,339500.0,340400.0,341300.0,342600.0,344400.0,...,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600,84654
1,235700.0,236900.0,236700.0,235400.0,233300.0,230600.0,227300.0,223400.0,219600.0,215800.0,...,310000,312500,314100,315000,316600,318100,319600,321100,321800,90668
2,210400.0,212200.0,212200.0,210700.0,208300.0,205500.0,202500.0,199800.0,198300.0,197300.0,...,320600,320200,320400,320800,321200,321200,323000,326900,329900,91982
3,498100.0,500900.0,503100.0,504600.0,505500.0,505700.0,505300.0,504200.0,503600.0,503400.0,...,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000,84616
4,77300.0,77300.0,77300.0,77300.0,77400.0,77500.0,77600.0,77700.0,77700.0,77800.0,...,119400,120000,120300,120300,120300,120300,120500,121000,121500,93144


In [119]:
data_df.columns = pd.to_datetime(data_df.columns)
data_df.head()

ParserError: Unknown string format: RegionID_copy

In [99]:
# changing values to thousands for ease of reading
data_df = data_df.applymap(lambda x: x/1000)
data_df.head()

Unnamed: 0,1996-04-01,1996-05-01,1996-06-01,1996-07-01,1996-08-01,1996-09-01,1996-10-01,1996-11-01,1996-12-01,1997-01-01,...,2017-07-01,2017-08-01,2017-09-01,2017-10-01,2017-11-01,2017-12-01,2018-01-01,2018-02-01,2018-03-01,2018-04-01
0,334.2,335.4,336.5,337.6,338.5,339.5,340.4,341.3,342.6,344.4,...,1005.5,1007.5,1007.8,1009.6,1013.3,1018.7,1024.4,1030.7,1033.8,1030.6
1,235.7,236.9,236.7,235.4,233.3,230.6,227.3,223.4,219.6,215.8,...,308.0,310.0,312.5,314.1,315.0,316.6,318.1,319.6,321.1,321.8
2,210.4,212.2,212.2,210.7,208.3,205.5,202.5,199.8,198.3,197.3,...,321.0,320.6,320.2,320.4,320.8,321.2,321.2,323.0,326.9,329.9
3,498.1,500.9,503.1,504.6,505.5,505.7,505.3,504.2,503.6,503.4,...,1289.8,1287.7,1287.4,1291.5,1296.6,1299.0,1302.7,1306.4,1308.5,1307.0
4,77.3,77.3,77.3,77.3,77.4,77.5,77.6,77.7,77.7,77.8,...,119.1,119.4,120.0,120.3,120.3,120.3,120.3,120.5,121.0,121.5


In [101]:
data_df.head()

Unnamed: 0,1996-04-01,1996-05-01,1996-06-01,1996-07-01,1996-08-01,1996-09-01,1996-10-01,1996-11-01,1996-12-01,1997-01-01,...,2017-07-01,2017-08-01,2017-09-01,2017-10-01,2017-11-01,2017-12-01,2018-01-01,2018-02-01,2018-03-01,2018-04-01
0,334.2,335.4,336.5,337.6,338.5,339.5,340.4,341.3,342.6,344.4,...,1005.5,1007.5,1007.8,1009.6,1013.3,1018.7,1024.4,1030.7,1033.8,1030.6
1,235.7,236.9,236.7,235.4,233.3,230.6,227.3,223.4,219.6,215.8,...,308.0,310.0,312.5,314.1,315.0,316.6,318.1,319.6,321.1,321.8
2,210.4,212.2,212.2,210.7,208.3,205.5,202.5,199.8,198.3,197.3,...,321.0,320.6,320.2,320.4,320.8,321.2,321.2,323.0,326.9,329.9
3,498.1,500.9,503.1,504.6,505.5,505.7,505.3,504.2,503.6,503.4,...,1289.8,1287.7,1287.4,1291.5,1296.6,1299.0,1302.7,1306.4,1308.5,1307.0
4,77.3,77.3,77.3,77.3,77.4,77.5,77.6,77.7,77.7,77.8,...,119.1,119.4,120.0,120.3,120.3,120.3,120.3,120.5,121.0,121.5


In [103]:
data_df.describe().round(2)

Unnamed: 0,1996-04-01,1996-05-01,1996-06-01,1996-07-01,1996-08-01,1996-09-01,1996-10-01,1996-11-01,1996-12-01,1997-01-01,...,2017-07-01,2017-08-01,2017-09-01,2017-10-01,2017-11-01,2017-12-01,2018-01-01,2018-02-01,2018-03-01,2018-04-01
count,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,13684.0,...,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0,14723.0
mean,118.3,118.42,118.54,118.65,118.78,118.93,119.12,119.35,119.69,120.12,...,273.34,274.87,276.46,278.03,279.52,281.1,282.66,284.37,286.51,288.04
std,86.0,86.16,86.31,86.47,86.65,86.87,87.15,87.48,87.91,88.41,...,360.4,361.47,362.76,364.46,365.6,367.05,369.57,371.77,372.46,372.05
min,11.3,11.5,11.6,11.8,11.8,12.0,12.1,12.2,12.3,12.5,...,14.4,14.5,14.7,14.8,14.5,14.3,14.1,13.9,13.8,13.8
25%,68.8,68.9,69.1,69.2,69.38,69.5,69.6,69.8,70.0,70.3,...,126.9,127.5,128.2,128.7,129.25,129.9,130.6,131.05,131.95,132.4
50%,99.5,99.5,99.7,99.7,99.8,99.9,99.95,100.1,100.3,100.6,...,188.4,189.6,190.5,191.4,192.5,193.4,194.1,195.0,196.7,198.1
75%,143.2,143.3,143.22,143.22,143.5,143.7,143.9,144.12,144.3,144.6,...,305.0,306.65,308.5,309.8,311.7,313.4,315.1,316.85,318.85,321.1
max,3676.7,3704.2,3729.6,3754.6,3781.8,3813.5,3849.6,3888.9,3928.8,3964.6,...,18889.9,18703.5,18605.3,18569.4,18428.8,18307.1,18365.9,18530.4,18337.7,17894.9


## Slicing out Years and attempting to Melt

In [110]:
yr_1996 = data_df.iloc[:,:9]
yr_1996.head()

Unnamed: 0,1996-04-01,1996-05-01,1996-06-01,1996-07-01,1996-08-01,1996-09-01,1996-10-01,1996-11-01,1996-12-01
0,334.2,335.4,336.5,337.6,338.5,339.5,340.4,341.3,342.6
1,235.7,236.9,236.7,235.4,233.3,230.6,227.3,223.4,219.6
2,210.4,212.2,212.2,210.7,208.3,205.5,202.5,199.8,198.3
3,498.1,500.9,503.1,504.6,505.5,505.7,505.3,504.2,503.6
4,77.3,77.3,77.3,77.3,77.4,77.5,77.6,77.7,77.7


In [111]:
melt_data(yr_1996)

KeyError: "The following 'id_vars' are not present in the DataFrame: ['City', 'CountyName', 'Metro', 'RegionID', 'RegionName', 'SizeRank', 'State']"

# Archive

In [106]:
#data_df.isnull().sum()

In [None]:
#df.columns[7:]

In [None]:
#date_time_cols = pd.to_datetime(df.columns[7:])
#date_time_cols

In [None]:
#cat_cols = df.columns[:7]
#cat_cols

In [None]:
#new_cols = cat_cols + date_time_cols
#new_cols

In [None]:
#df.rename(columns[7:] = date_time_cols)

## Transposing DF

In [None]:
flipped_df = df.transpose()

In [None]:
flipped_df.head(10)

In [None]:
#melt_data(flipped_df)
melt_data(df)

In [None]:
#roi = table9.apply(lambda x: x['total_profit'] / x['production_budget'], axis=1)
test = df.loc[8]
#test

In [None]:
melt_data(test)

## Exploring SizeRank

In [None]:
df.sort_values('SizeRank').head()

In [None]:
df.sort_values(by = ['State', 'City'], ascending = True)

In [None]:
cat_data = df[['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName', 'SizeRank']]

In [None]:
cat_data

In [None]:
df.head()

In [None]:
date_data = df.drop(['RegionID', 'RegionName', 'City', 'State', 'Metro', 'CountyName', 'SizeRank'], axis=1)

In [None]:
date_data.head()

In [None]:
date_time_data = get_datetimes(date_data)

In [None]:
date_time_data

In [None]:
len(date_time_data)

In [None]:
list(date_data.columns)

In [None]:
#df2 = date_data.append(date_time_data)

## Changing column names to DateTime Format

In [None]:
#df.columns = pd.to_datetime(df.columns)
date_data.columns = pd.to_datetime(date_data.columns)

In [None]:
date_data

### Merging with Categories

In [None]:
#imdb_df = pd.merge(table7, table4, on= 'tconst', how='inner')
base_df = pd.merge(cat_data, date_data, how='outer')

# Mod 4 Project - Starter Notebook

This notebook has been provided to you so that you can make use of the following starter code to help with the trickier parts of preprocessing the Zillow dataset. 

The notebook contains a rough outline the general order you'll likely want to take in this project. You'll notice that most of the areas are left blank. This is so that it's more obvious exactly when you should make use of the starter code provided for preprocessing. 

**_NOTE:_** The number of empty cells are not meant to infer how much or how little code should be involved in any given step--we've just provided a few for your convenience. Add, delete, and change things around in this notebook as needed!

## Some Notes Before Starting

This project will be one of the more challenging projects you complete in this program. This is because working with Time Series data is a bit different than working with regular datasets. In order to make this a bit less frustrating and help you understand what you need to do (and when you need to do it), we'll quickly review the dataset formats that you'll encounter in this project. 

### Wide Format vs Long Format

If you take a look at the format of the data in `zillow_data.csv`, you'll notice that the actual Time Series values are stored as separate columns. Here's a sample: 

<img src='https://raw.githubusercontent.com/learn-co-students/dsc-mod-4-project-seattle-ds-102819/master/images/df_head.png'>

You'll notice that the first seven columns look like any other dataset you're used to working with. However, column 8 refers to the median housing sales values for April 1996, column 9 for May 1996, and so on. This This is called **_Wide Format_**, and it makes the dataframe intuitive and easy to read. However, there are problems with this format when it comes to actually learning from the data, because the data only makes sense if you know the name of the column that the data can be found it. Since column names are metadata, our algorithms will miss out on what dates each value is for. This means that before we pass this data to our ARIMA model, we'll need to reshape our dataset to **_Long Format_**. Reshaped into long format, the dataframe above would now look like:

<img src='https://raw.githubusercontent.com/learn-co-students/dsc-mod-4-project-seattle-ds-102819/master/images/melted1.png'>

There are now many more rows in this dataset--one for each unique time and zipcode combination in the data! Once our dataset is in this format, we'll be able to train an ARIMA model on it. The method used to convert from Wide to Long is `pd.melt()`, and it is common to refer to our dataset as 'melted' after the transition to denote that it is in long format. 

## Helper Functions Provided

Melting a dataset can be tricky if you've never done it before, so you'll see that we have provided a sample function, `melt_data()`, to help you with this step below. Also provided is:

* `get_datetimes()`, a function to deal with converting the column values for datetimes as a pandas series of datetime objects
* Some good parameters for matplotlib to help make your visualizations more readable. 

Good luck!


## Step 1: Load the Data/Filtering for Chosen Zipcodes

## Step 2: Data Preprocessing

In [None]:
def get_datetimes(df):
    """
    Takes a dataframe:
    returns only those column names that can be converted into datetime objects 
    as datetime objects.
    NOTE number of returned columns may not match total number of columns in passed dataframe
    """
    
    return pd.to_datetime(df.columns.values[1:], format='%Y-%m')

## Step 3: EDA and Visualization

In [None]:
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)

# NOTE: if you visualizations are too cluttered to read, try calling 'plt.gcf().autofmt_xdate()'!

## Step 4: Reshape from Wide to Long Format

In [None]:
def melt_data(df):
    """
    Takes the zillow_data dataset in wide form or a subset of the zillow_dataset.  
    Returns a long-form datetime dataframe 
    with the datetime column names as the index and the values as the 'values' column.
    
    If more than one row is passes in the wide-form dataset, the values column
    will be the mean of the values from the datetime columns in all of the rows.
    """
    
    melted = pd.melt(df, id_vars=['RegionName', 'RegionID', 'SizeRank', 'City', 'State', 'Metro', 'CountyName'], var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    return melted.groupby('time').aggregate({'value':'mean'})

## Step 5: ARIMA Modeling

## Step 6: Interpreting Results