# Data Management and Visualization, assignment 3: Data Management Decisions

## Dataset
I am using the bike sharing dataset of https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset, which provides data on how many bikes are rent out on specific hours of specific days, the weather on these days, whether the day was a weekday or not, etc. I would like to be able to predict the number of bikes rented out, given some information on a specific day and/or time (e.g. the weather conditions, which day/time it is, whether it is a weekday/holiday, etc.).

## Loading data

In [25]:
import pandas
data = pandas.read_csv('hour.csv', low_memory=False)
print("Number of observations: ",len(data))
data.head()

Number of observations:  17379


Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


## Dealing with missing data
There is no missing data, so I don't need to do anything special here.

## Creating a secondary variable for the month number
To see if there are any trends in how many bikes are rent out over time (that is: is the bike rental company renting out more bikes over time in general), I create a secondary variable month_no, which encodes the number of the month since the first observation in the dataset.

In [26]:
data['month_no'] = data['yr']*12+data['mnth']
pandas.crosstab(data.month_no, data.mnth)

mnth,1,2,3,4,5,6,7,8,9,10,11,12
month_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,688,0,0,0,0,0,0,0,0,0,0,0
2,0,649,0,0,0,0,0,0,0,0,0,0
3,0,0,730,0,0,0,0,0,0,0,0,0
4,0,0,0,719,0,0,0,0,0,0,0,0
5,0,0,0,0,744,0,0,0,0,0,0,0
6,0,0,0,0,0,720,0,0,0,0,0,0
7,0,0,0,0,0,0,744,0,0,0,0,0
8,0,0,0,0,0,0,0,731,0,0,0,0
9,0,0,0,0,0,0,0,0,717,0,0,0
10,0,0,0,0,0,0,0,0,0,743,0,0


## Recoding season
In the codebook, season is described as being: (1:spring, 2:summer, 3:fall, 4:winter). This does not seem to match the real data:

In [27]:
pandas.crosstab(data.season, data.mnth)

mnth,1,2,3,4,5,6,7,8,9,10,11,12
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,1429,1341,949,0,0,0,0,0,0,0,0,523
2,0,0,524,1437,1488,960,0,0,0,0,0,0
3,0,0,0,0,0,480,1488,1475,1053,0,0,0
4,0,0,0,0,0,0,0,0,384,1451,1437,960


I will recode season to have some more descriptive categories.

In [28]:
seasmap = {1: '1. Winter', 2:'2. Spring', 3:'3. Summer', 4:'4. Fall'}
data['season'] = data['season'].map(seasmap)
pandas.crosstab(data.season, data.mnth)

mnth,1,2,3,4,5,6,7,8,9,10,11,12
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1. Winter,1429,1341,949,0,0,0,0,0,0,0,0,523
2. Spring,0,0,524,1437,1488,960,0,0,0,0,0,0
3. Summer,0,0,0,0,0,480,1488,1475,1053,0,0,0
4. Fall,0,0,0,0,0,0,0,0,384,1451,1437,960


## Selecting data and frequency tables of selected data

In [41]:
sub = data[['season','month_no', 'hr','workingday', 'weathersit','temp','hum', 'windspeed','cnt']]

### Frequency table of month_no

In [37]:
print(sub['month_no'].value_counts(sort=False).sort_index())
print(sub['month_no'].value_counts(sort=False,normalize=True).sort_index())

1     688
2     649
3     730
4     719
5     744
6     720
7     744
8     731
9     717
10    743
11    719
12    741
13    741
14    692
15    743
16    718
17    744
18    720
19    744
20    744
21    720
22    708
23    718
24    742
Name: month_no, dtype: int64
1     0.039588
2     0.037344
3     0.042005
4     0.041372
5     0.042810
6     0.041429
7     0.042810
8     0.042062
9     0.041257
10    0.042753
11    0.041372
12    0.042638
13    0.042638
14    0.039818
15    0.042753
16    0.041314
17    0.042810
18    0.041429
19    0.042810
20    0.042810
21    0.041429
22    0.040739
23    0.041314
24    0.042695
Name: month_no, dtype: float64


### Frequency table of season

In [38]:
print(sub['season'].value_counts(sort=False).sort_index())
print(sub['season'].value_counts(sort=False,normalize=True).sort_index())

1. Winter    4242
2. Spring    4409
3. Summer    4496
4. Fall      4232
Name: season, dtype: int64
1. Winter    0.244088
2. Spring    0.253697
3. Summer    0.258703
4. Fall      0.243512
Name: season, dtype: float64


### Frequency tables of other categorical variables of interest

In [39]:
print(sub['weathersit'].value_counts(sort=False))
print(sub['weathersit'].value_counts(sort=False, normalize=True))

1    11413
2     4544
3     1419
4        3
Name: weathersit, dtype: int64
1    0.656712
2    0.261465
3    0.081650
4    0.000173
Name: weathersit, dtype: float64


In [42]:
print(sub['workingday'].value_counts(sort=False))
print(sub['workingday'].value_counts(sort=False,normalize=True))

0     5514
1    11865
Name: workingday, dtype: int64
0    0.317279
1    0.682721
Name: workingday, dtype: float64


In [44]:
print(sub['hr'].value_counts(sort=False).sort_index())
print(sub['hr'].value_counts(sort=False,normalize=True).sort_index())

0     726
1     724
2     715
3     697
4     697
5     717
6     725
7     727
8     727
9     727
10    727
11    727
12    728
13    729
14    729
15    729
16    730
17    730
18    728
19    728
20    728
21    728
22    728
23    728
Name: hr, dtype: int64
0     0.041775
1     0.041659
2     0.041142
3     0.040106
4     0.040106
5     0.041257
6     0.041717
7     0.041832
8     0.041832
9     0.041832
10    0.041832
11    0.041832
12    0.041890
13    0.041947
14    0.041947
15    0.041947
16    0.042005
17    0.042005
18    0.041890
19    0.041890
20    0.041890
21    0.041890
22    0.041890
23    0.041890
Name: hr, dtype: float64


## Summary
There is no missing data in my dataset, and most of the data is already in a clean format, so not much managment was needed. I added a variable for the month number since the first observation and recoded the season variable to be more descriptive.

Seasons, month numbers, and hours show a pretty uniformly distribution, indicating that no (or very few) obervations are omitted from the dataset.

68% of the observations are on working days. Only very few (4 out of 17379 observations = 0.017%) are on very bad weather (encoded as 4:Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog).