# Capstone Project

## Project Definition

In this project we will investigate the American Time Use Survey (ATUS) and use Singular Value Decomposition to see if we can predict what type of occupation someone has based on how they spend their day. We will also take a quick look to see if the data supports Stanley Robert Parker's thesis in **The Future of Work and Leisure** people who have more say in decision-making at their jobs (autonomy) participate in more active leisure activities than those who have low autonomy in their jobs.

We will answer 3 questions:

- Can we predict the occupation type for high or low autonomy occupations based on the categorized activities
- Can we predict the occupation type for all occupations on the categorized activities
- Can we predict the occupation type for all occupations on the individual activities

In order to determine the accuracy of our predictions, we will use the differences between the actual values of the organizations versus the expected values for those occupations.

## Analysis

The structure of this section was inspired by [Paula Hwang](https://github.com/hwangmpaula/data-wrangling/blob/master/wrangle_act.ipynb)

### Gather

Using the ATUS data from [Kaggle](https://www.kaggle.com/bls/american-time-use-survey) we will analyse whether we can predict occupation type from daily activities. 

The data forms I need are the **Activity summary** file and the **Respondent** file.

**Activity summary** from Kaggle:

> The Activity summary file contains information about the total time each ATUS respondent spent doing each activity on the diary day.

**Respondent** from Kaggle:

> The Respondent file contains information about ATUS respondents, including their labor force status and earnings.

These are already packaged in neat .csvs for data science use.

In [1]:
#import libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests

# Load datasets
ATUS_sum = pd.read_csv('../data/atussum.csv') #The activity summary datafile
ATUS_resp = pd.read_csv('../data/atusresp.csv') #The respondent datafile

### Assess

**1. Quality**
(1 issues)
  - Remove rows without occupation codes

**2. Tidyness**
(3 issues)
   - Remove unnecessary columns from Respondent
   - Remove unnecessary columns from Activity Summary
   - Unify into one dataframe
   
**3. Preparation**
(3 issues)
   - Create life space categories
   - Create categorized dataframe
   - Create high and low only categorized dataframe

In [2]:
# A quick visual of the activity summary dataframe
ATUS_sum.head(10)

Unnamed: 0,tucaseid,gemetsta,gtmetsta,peeduca,pehspnon,ptdtrace,teage,telfs,temjot,teschenr,...,t181801,t181899,t189999,t500101,t500103,t500104,t500105,t500106,t500107,t509989
0,20030100013280,1,-1,44,2,2,60,2,2,-1,...,0,0,0,0,0,0,0,0,0,0
1,20030100013344,2,-1,40,2,1,41,1,2,2,...,0,0,0,0,0,0,0,0,0,0
2,20030100013352,1,-1,41,2,1,26,2,2,2,...,0,0,0,0,0,0,0,0,0,0
3,20030100013848,2,-1,39,2,2,36,4,-1,2,...,0,0,0,0,0,0,0,0,0,0
4,20030100014165,2,-1,45,2,1,51,1,2,-1,...,0,0,0,0,0,0,0,0,0,0
5,20030100014169,2,-1,43,2,1,32,2,2,1,...,0,0,0,0,0,0,0,0,0,0
6,20030100014209,1,-1,39,2,1,44,1,2,2,...,0,0,0,0,0,0,0,0,0,0
7,20030100014427,1,-1,40,2,1,21,1,2,2,...,0,0,0,0,0,0,0,0,0,0
8,20030100014550,2,-1,41,2,1,33,1,2,2,...,0,0,0,0,0,0,0,0,0,0
9,20030100014758,1,-1,41,2,2,39,1,2,2,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Get the shape of the activity summary data
ATUS_sum.shape

(170842, 455)

In [4]:
# Check the case ids for uniqueness
ATUS_sum.tucaseid.nunique()

170842

In [5]:
# Looks good. Now to figure out which is the first column with activity information
ATUS_sum_columns = ATUS_sum.columns.to_list()
ATUS_sum_columns[:30]

['tucaseid',
 'gemetsta',
 'gtmetsta',
 'peeduca',
 'pehspnon',
 'ptdtrace',
 'teage',
 'telfs',
 'temjot',
 'teschenr',
 'teschlvl',
 'tesex',
 'tespempnot',
 'trchildnum',
 'trdpftpt',
 'trernwa',
 'trholiday',
 'trspftpt',
 'trsppres',
 'tryhhchild',
 'tudiaryday',
 'tufnwgtp',
 'tehruslt',
 'tuyear',
 't010101',
 't010102',
 't010199',
 't010201',
 't010299',
 't010301']

In [6]:
# 't010101' is the first activity column (which matches the data dictionary)
# Setting it's location in a variable for later
act_start = ATUS_sum_columns.index('t010101')

In [7]:
# Checking for any NaN values in the data I'll be using
ATUS_sum[ATUS_sum_columns[act_start:]].isnull().sum().sort_values(ascending=False)

t509989    0
t060101    0
t060103    0
t060104    0
t060199    0
          ..
t130136    0
t130199    0
t130201    0
t130202    0
t010101    0
Length: 431, dtype: int64

In [8]:
# No NaNs, good sign.
# Lets take a look at the most common values for the sleep category (t010101)
ATUS_sum['t010101'].value_counts()

480     10495
540     10296
510      8708
600      7467
450      7225
        ...  
1088        1
961         1
1089        1
1345        1
1023        1
Name: t010101, Length: 1120, dtype: int64

In [9]:
# A quick visual of the respondent dataframe
ATUS_resp.head(10)

Unnamed: 0,tucaseid,tulineno,tespuhrs,trdtind1,trdtocc1,trernhly,trernupd,trhernal,trhhchild,trimind1,...,tryhhchild,trwbmodr,trtalone_wk,trtccc_wk,trlvmodr,trtec,tuecytd,tuelder,tuelfreq,tuelnum
0,20030100013280,1,-1.0,40,8,2200.0,1,1,2,15,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,20030100013344,1,50.0,16,16,-1.0,1,-1,1,5,...,0,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,20030100013352,1,-1.0,43,15,1250.0,0,0,2,16,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,20030100013848,1,40.0,-1,-1,-1.0,-1,-1,1,-1,...,9,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,20030100014165,1,-1.0,42,10,-1.0,-1,-1,1,16,...,14,-1,-1,-1,-1,-1,-1,-1,-1,-1
5,20030100014169,1,40.0,40,8,-1.0,1,-1,1,15,...,2,-1,-1,-1,-1,-1,-1,-1,-1,-1
6,20030100014209,1,50.0,43,15,-1.0,-1,-1,1,16,...,9,-1,-1,-1,-1,-1,-1,-1,-1,-1
7,20030100014427,1,-1.0,41,11,950.0,0,0,1,16,...,14,-1,-1,-1,-1,-1,-1,-1,-1,-1
8,20030100014550,1,40.0,34,17,1400.0,0,0,1,12,...,3,-1,-1,-1,-1,-1,-1,-1,-1,-1
9,20030100014758,1,-1.0,41,11,1200.0,0,0,1,16,...,4,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [10]:
#get the shape of the respondent data
ATUS_resp.shape

(170842, 132)

In [11]:
#check the case ids for uniqueness
ATUS_resp.tucaseid.nunique()

170842

In [12]:
#check how many rows are without an occupation code (trdtocc1), which is listed in the data as -1
ATUS_resp['trdtocc1'].value_counts()

-1     64220
 17    14068
 1     12867
 16    10649
 8      7735
 10     6174
 21     6078
 22     5504
 2      5288
 19     4775
 13     4337
 14     3966
 15     3796
 20     3399
 3      3115
 11     2463
 4      2336
 12     2246
 9      2238
 6      2124
 7      1462
 5      1226
 18      776
Name: trdtocc1, dtype: int64

Unfortunately, most of the occupation cells are blank.

Lastly, in order to answer the questions, we need to categorize the activites into Parker's six 'life space' categories. More on this in section **3.1** of the **Clean** section below.

### Clean

Method for each issue in Assess

 - Define the problem
 - Code the solution
 - Test the solution

**1.1 Remove rows without occupation codes - Define**

Rows without occupation codes are not useful in this investigation. All rows with `-1` in the `trdtocc1` tab need to be removed.

**1.1 Remove rows without occupation codes - Code**

In [13]:
ATUS_resp = ATUS_resp.query('trdtocc1 != -1')

**1.1 Remove rows without occupation codes - Test**

In [14]:
ATUS_resp.trdtocc1.value_counts()

17    14068
1     12867
16    10649
8      7735
10     6174
21     6078
22     5504
2      5288
19     4775
13     4337
14     3966
15     3796
20     3399
3      3115
11     2463
4      2336
12     2246
9      2238
6      2124
7      1462
5      1226
18      776
Name: trdtocc1, dtype: int64

**2.1 Remove unnecessary columns in Respondent - Define**

The only row we need from the **Respondent** is the `trdtocc1` column. All others can be removed.

**2.1 Remove unnecessary columns in Respondent - Code**

In [15]:
ATUS_resp = ATUS_resp[['tucaseid','trdtocc1']]

**2.1 Remove unnecessary columns in Respondent - Test**

In [16]:
ATUS_resp.head(10)

Unnamed: 0,tucaseid,trdtocc1
0,20030100013280,8
1,20030100013344,16
2,20030100013352,15
4,20030100014165,10
5,20030100014169,8
6,20030100014209,15
7,20030100014427,11
8,20030100014550,17
9,20030100014758,11
10,20030100014928,16


**2.2 Remove unnecessary columns in Activity summary - Define**

For **Activity summary** we do not need any columns that contain data on the respondent, We only need to know what their time spent doing the individual activities. We also do not need the last seven columns because they measure entry errors.

**2.2 Remove unnecessary columns in Activity summary - Code**

In [17]:
#remove non-activity columns
ATUS_sum = ATUS_sum.drop(columns=ATUS_sum_columns[1:act_start])
#remove the entry error activity columns (t5#####)
ATUS_sum = ATUS_sum.drop(columns=ATUS_sum_columns[-7:])

**2.2 Remove unnecessary columns in Activity summary - Test**

In [18]:
ATUS_sum.head()

Unnamed: 0,tucaseid,t010101,t010102,t010199,t010201,t010299,t010301,t010399,t010401,t010499,...,t181399,t181401,t181499,t181501,t181599,t181601,t181699,t181801,t181899,t189999
0,20030100013280,870,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,20030100013344,620,0,0,60,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,20030100013352,560,0,0,80,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,20030100013848,720,0,0,35,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,20030100014165,385,0,0,75,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**2.3 Make one new database - Define**

We need to join the **Activity summary** and **Respondent** dataframes in order to complete our full analysis.

**2.3 Make one new database - Code**

In [19]:
ATUS_data = ATUS_resp.set_index('tucaseid').join(ATUS_sum.set_index('tucaseid'))

**2.3 Make one new database - Test**

In [20]:
ATUS_data.head()

Unnamed: 0_level_0,trdtocc1,t010101,t010102,t010199,t010201,t010299,t010301,t010399,t010401,t010499,...,t181399,t181401,t181499,t181501,t181599,t181601,t181699,t181801,t181899,t189999
tucaseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20030100013280,8,870,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20030100013344,16,620,0,0,60,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20030100013352,15,560,0,0,80,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20030100014165,10,385,0,0,75,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20030100014169,8,675,0,0,35,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
#check for correct shape
ATUS_data.shape

(106622, 425)

**3.1 Create life space categories - Define**

Stanley Robert Parker's **The Future of Work and Leisure** (1971) lays out six components of 'life space': **Work**, **Work obligations**, **Physiological needs**, **Non-work obligations**, **Leisure**, **Leisure-in-work**.

According to Parker:
>'Life space' means the total of activities or ways of spending time that people have.
	
I've classified the activities included in the American Time Use Survey in these categories, I have listed the breakdown below.

**Work**
_working time, sold time, substinence time_

Time spent in employment, 'earning a living' for the worker and their dependants. 


**Work obligations**
_Work-related time_

Activities around work, but not during employment. Examples include communting to work or grooming for work.



**Physiological needs**
_Existence time_

The necessitities of life, eating, sleeping, eliminating, etc.


**Non-work obligations**
_semi-work_

Activities that we once leisure but have taken the character of obligations. These are often obligations toward other people. 

**Leisure**
_free time, spare time, uncommitted time, discretionary time, choosing time_

With two subcategories, **Active leisure** and **Passive leisure**.

**Active leisure** is more engaged and creative.


**Passive leisure** is more consumptive and less engaged


For simplicity's sake, **Leisure-in-work** will be classified with **Work obligations**.

**3.1 Create life space categories - Code**

In [22]:
#get the list of columns
data_columns = ATUS_data.columns.tolist()
#remove occupation (trdtocc1) column
data_columns.pop(0)

#create new lists
work = []
work_obligations = []
physiological_needs = []
non_work_obligations = []
active_leisure = []
passive_leisure = []

for i in range(len(data_columns)):
    #classify into work
    if data_columns[i] == "t050101":
        work.append(data_columns[i])
    #classify into work obligations
    elif ((data_columns[i].find('t05') != -1) and (data_columns[i] != '150501')) or (data_columns[i] == 't180501'):
        work_obligations.append(data_columns[i])
    #classify into physiological needs
    elif (data_columns[i].find('t01') != -1) or (data_columns[i].find('t11') != -1):
        physiological_needs.append(data_columns[i])
    #classify non-work obligations
    elif (data_columns[i].find('t02') != -1) or (data_columns[i].find('t03') != -1):
        non_work_obligations.append(data_columns[i])
    elif (data_columns[i].find('t04') != -1) or (data_columns[i].find('t06') != -1):
        non_work_obligations.append(data_columns[i])
    elif (data_columns[i].find('t06') != -1) or (data_columns[i].find('t08') != -1):
        non_work_obligations.append(data_columns[i])
    elif (data_columns[i].find('t09') != -1) or (data_columns[i].find('t10') != -1):
        non_work_obligations.append(data_columns[i])
    elif (data_columns[i].find('t14') != -1) or (data_columns[i].find('t15') != -1):
        non_work_obligations.append(data_columns[i])
    elif (data_columns[i].find('t16') != -1) or ((data_columns[i].find('t18') != -1) and (data_columns[i] != 't180501')):
        non_work_obligations.append(data_columns[i])
    #classify active leisure
    elif (data_columns[i].find('t1201') != -1) or (data_columns[i].find('t1202') != -1):
        active_leisure.append(data_columns[i])
    elif (data_columns[i].find('t12031') != -1) or (data_columns[i].find('t12039') != -1):
        active_leisure.append(data_columns[i])
    elif (data_columns[i].find('t1204') != -1) or (data_columns[i].find('t1205') != -1):
        active_leisure.append(data_columns[i])
    elif (data_columns[i].find('t1299') != -1) or (data_columns[i].find('t1301') != -1):
        active_leisure.append(data_columns[i])
    elif (data_columns[i].find('t1399') != -1) or (data_columns[i] == 't120309'):
        active_leisure.append(data_columns[i])
    #classify passive leisure
    elif (data_columns[i].find('t07') != -1) or (data_columns[i].find('t1302') != -1):
        passive_leisure.append(data_columns[i])
    elif (data_columns[i].find('t1303') != -1) or (data_columns[i].find('t1304') != -1):
        passive_leisure.append(data_columns[i])
    elif (data_columns[i].find('t12030') != -1) and (data_columns[i] != 't120309'):
        passive_leisure.append(data_columns[i])


**3.1 Create life space categories - Test**

In [23]:
#check to see if the categories are equal to the original column numbers
assert len(data_columns) == len(work) + len(work_obligations) + len(physiological_needs) + len(non_work_obligations) + len(active_leisure) + len(passive_leisure), 'Lists are unequal'

**3.2 Create categorized dataframe - Define**

Now that we have the categories set up, we can flatten the columns in those categories and create a new dataframe.

**3.2 Create categorized dataframe - Code**

In [24]:
#Create new dataframe
ATUS_category_data = ATUS_data[['trdtocc1','t010101']]

In [25]:
#create lists of the categories
life_space_categories = [work, work_obligations, physiological_needs, non_work_obligations, active_leisure, passive_leisure]
life_space_titles = ['Work', 'Work Obligations', 'Physiological Needs', 'Non-work Obligations', 'Active Leisure', "Passive Leisure"]

#loop through the dataframe to create new columns
for i in range(len(life_space_categories)):
    ATUS_category_data.insert(loc=ATUS_category_data.shape[1],column=life_space_titles[i], value=ATUS_data[life_space_categories[i]].sum(axis=1))

In [26]:
#drop the t010101 column that was there as a placeholder
ATUS_category_data = ATUS_category_data.drop(columns='t010101')

**3.2 Create categorized dataframe - Test**

In [27]:
#take a look
ATUS_category_data.head()

Unnamed: 0_level_0,trdtocc1,Work,Work Obligations,Physiological Needs,Non-work Obligations,Active Leisure,Passive Leisure
tucaseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
20030100013280,8,0,0,915,0,200,325
20030100013344,16,0,0,770,80,530,60
20030100013352,15,0,0,715,325,220,180
20030100014165,10,500,15,552,253,60,60
20030100014169,8,0,0,775,360,195,110


**3.3 Create high and low only categorized dataframe - Define**

For the most simplistic analysis, we will see if the categories predict correctly for the most autonomous category of occupation, Administration (`1`) and a category with a low amount of autonomy Transportation (`22`). For this we need the earlier `ATUS_category_data` with less occupations.

**3.3 Create high and low only categorized dataframe - Code**

In [28]:
#keep only the occupations we're interested in
ATUS_category_data_lite = ATUS_category_data.query('trdtocc1 == 1 or trdtocc1 == 22')
#to make things easier, we'll change '22' to 2
ATUS_category_data_lite['trdtocc1'] = ATUS_category_data_lite['trdtocc1'].replace(22, 2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ATUS_category_data_lite['trdtocc1'] = ATUS_category_data_lite['trdtocc1'].replace(22, 2)


**3.3 Create high and low only categorized dataframe - Test**

In [29]:
ATUS_category_data_lite['trdtocc1'].value_counts()

1    12867
2     5504
Name: trdtocc1, dtype: int64

In [30]:
ATUS_category_data.groupby('trdtocc1').mean()

Unnamed: 0_level_0,Work,Work Obligations,Physiological Needs,Non-work Obligations,Active Leisure,Passive Leisure
trdtocc1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,270.683221,31.817362,611.18598,249.244657,91.086267,176.358281
2,239.27969,30.694024,620.980144,260.142398,93.788011,184.569213
3,243.123596,30.102408,618.103371,244.128732,92.481862,202.431461
4,250.236729,31.401541,612.805223,255.783818,92.79238,187.434932
5,241.923328,29.896411,620.863785,268.460848,99.54894,168.76509
6,239.54096,33.11064,617.599341,274.206215,85.572976,179.326271
7,237.874829,31.705198,622.075239,258.822845,106.738714,171.419973
8,199.003361,26.023917,627.971946,296.603878,102.07576,175.891273
9,229.446381,35.616175,622.16756,262.750223,100.276586,178.155496
10,250.895206,30.49838,615.300292,280.42517,88.331552,164.201166


### Analyze

In [31]:
# Let's see if the high autonomony group (1) and low autonomy group (2)
ATUS_category_data_lite.groupby('trdtocc1').mean()

Unnamed: 0_level_0,Work,Work Obligations,Physiological Needs,Non-work Obligations,Active Leisure,Passive Leisure
trdtocc1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,270.683221,31.817362,611.18598,249.244657,91.086267,176.358281
2,259.09048,30.229106,623.946584,203.973292,78.455487,235.025073


Between the two groups the high autonomy group has more active leisure than the low autonomy group. While the low autonomy group does more passive leisure than the high autonomy group, the passive leisure is much higher for both groups.

**Regression**

Inspired by [Data Driven Science and Engineering](http://www.databookuw.com/) by Steve L. Brunton and J. Nathan Kutz 


In [32]:
def SVD_Regression(df, column):
    '''
    This program computes a regression from a single value decomposition. We generate and approximate x for an Ax = b.
    
    inputs:
    df - a dataframe
    column - string, our target column
    
    returns
    A is the matrix.
    b is the target column.
    b_x is the generated b
    '''
    #get our matrix and target column
    b = df[column] # occupations
    A = df.drop(columns=[column]) # activities
    
    #generate the values from SVD
    U, S, VT = np.linalg.svd(A,full_matrices=0)
    
    #find x by multiplying the target column by the pseudo-inverse of the matrix
    x = VT.T @ np.linalg.inv(np.diag(S)) @ U.T @ b
    
    return A, b, x

In [33]:
def SVD_Regression_Results(A, x, b):
    '''
    This function calculates the mean difference between the actual b column and our generated b column.
    It prints the rounded mean of the absolute difference.
    
    inputs:
    A is the matrix.
    b is the target column.
    b_x is the generated b
    
    returns:
    none
    '''
    #set up lists for our differences
    b_diffs = []
    b_diffs_abs = []
    
    #rest the indexes to make looping easier
    b.reset_index(drop=True, inplace=True)
    
    #generate our expected b values
    b_x = A @ x
    b_x.reset_index(drop=True, inplace=True)
    
    #loop through the columns to find the differences (and absolute differences) in the values
    for i in range(len(b)):
        b_diffs_abs.append(abs(b[i] - b_x[i]))
        b_diffs.append(b[i] - b_x[i])
        
    # calculate and print the mean
    b_mean = round(sum(b_diffs_abs) / len(b_diffs_abs), 1)
    print('The average of the absolute difference between actual b and expected b is: ' + b_mean)
    
    #visualize the differences and absolute differences
    SVD_viz(b_diffs, 'Difference between actual b and expected b')
    SVD_viz(b_diffs_abs, 'Absolute difference between actual b and expected b')

In [34]:
def SVD_viz(data, title):
    '''
    This function displays a histogram of the spread of differences.
    
    Inputs:
    data: list, the differences betwen b and expected b
    title: string, title of the graph
    
    returns:
    none
    '''
    #generate graph
    plt.subplot()
    plt.hist(data, 20)

    #label graph and print
    plt.xlabel('Difference')
    plt.ylabel('Count')
    plt.title(title)
    plt.show()

#### `ATUS_category_data_lite`

In [35]:
A, b, x = SVD_Regression(ATUS_category_data_lite,'trdtocc1')

In [36]:
SVD_Regression_Results(A, x, b)

TypeError: can only concatenate str (not "numpy.float64") to str

#### `ATUS_category_data`

In [None]:
A, b, x = SVD_Regression(ATUS_category_data,'trdtocc1')

In [None]:
SVD_Regression_Results(A, x, b)

#### `ATUS_data`

In [None]:
A, b, x = SVD_Regression(ATUS_data,'trdtocc1')

In [None]:
SVD_Regression_Results(A, x, b)

### Model

**Regression Machine Learning**

Inspired by Data Driven Science and Engineering by Steve L. Brunton and J. Nathan Kutz 
http://www.databookuw.com/

In [None]:
def SVD_Regression_Predict(df, column, n):
    '''
    This function uses the SVD to see if we can predict the values of our target column with a seperate set of data.
    
    inputs:
    df - a dataframe
    column - string, our target column
    n = int, size of the test dataframe
    
    returns:
    none
    '''

    #shuffle up the dataset
    df_p = df.sample(frac=1)
    
    #split the data into train and test sets
    
    #create train dataset
    df_train = df_p[1:n]

    #create test dataset
    df_test = df_p[n:]
    #create the matrix and target column of the test
    A_test = df_test.drop(columns=[column]) 
    b_test = df_test[column]
    
    #get the x back from the SVD regression function
    A_train, b_train, x = SVD_Regression(df_train, column)
    
    #send to the results funtion (and visualization function)
    SVD_Regression_Results(A_test, x, b_test)

#### `ATUS_category_data_lite`

In [None]:
SVD_Regression_Predict(ATUS_category_data_lite,'trdtocc1', 10000)

#### `ATUS_category_data`

In [None]:
SVD_Regression_Predict(ATUS_category_data,'trdtocc1', 100000)

#### `ATUS_data`

In [None]:
SVD_Regression_Predict(ATUS_data,'trdtocc1', 100000)

## Conclusion

As you can see from the graphs and difference means, the daily activities are not a good way to predict occupation. Although, to the credit of the SVD regression, the predicted values for the test were not much farther off from when we included those b values in the SVD. 