# Data Programming in Python | BAIS:6040
# Exam 1

## Instructions

To complete the exam, fill in the commands needed to finish all of the exercises below. Program everything inside this notebook. You may use more than one code cell under the "Your answer here" cell if you wish. If the exercises request that you store information within certain variables, please use those specific variables names (case sensitive).

## Questions

1\. Read all aspects of the question carefully and following the directions closely. (30 points)

- Create a new pandas dataframe called __randoms__ that is 3 different columns with 100 observations of random numbers with a mean of 1 and standard deviation of 2. The column names should be Rand1, Rand2, and Rand3. 


- Add a new column called __Date__ that is of type pandas datetime that is the first date of the year starting with 1970 and has a consistent date interval for the entire __randoms__ dataframe.  So, each observation will be the first date of the year for each year starting in 1970. (Jan 1, 1970; Jan 1, 1971; Jan 1, 1972; etc)


- Using the __context manager__, write out the randoms dataframe to a table called __randoms__ in an sqlite database called __numbers.db__ that is stored in the Data folder of your current working directory. If the table already exists, overwrite it. Do not write out the dataframe index as a column.

In [1]:
# Imports used 
import numpy as np
import pandas as pd
import datetime
import sqlite3 as sq3
import os  

In [2]:
# Answer for Question 1
randoms = pd.DataFrame(data = np.random.normal(1,2,(100,3))
                   ,columns=['Rand1', 'Rand2','Rand3'])
randoms['Date'] = [datetime.datetime(1970+i, 1, 1) for i in range(len(randoms))]

filepath='numbers.db'
with sq3.connect(filepath) as con1:
    randoms.to_sql('randoms', con=con1, if_exists='replace', index=False)
    
# Validate Answer 1
print(randoms.head())
with sq3.connect(filepath) as con2:
    randoms_sq=pd.read_sql('Select * From randoms', con2)

print(randoms_sq.head())

      Rand1     Rand2     Rand3       Date
0  2.332451  1.231290  1.629969 1970-01-01
1  1.900424 -3.776994  4.697971 1971-01-01
2 -1.531468  4.417328  4.086117 1972-01-01
3  3.961985  0.588490  1.388244 1973-01-01
4  3.090142 -2.857340 -0.198687 1974-01-01
      Rand1     Rand2     Rand3                 Date
0  2.332451  1.231290  1.629969  1970-01-01 00:00:00
1  1.900424 -3.776994  4.697971  1971-01-01 00:00:00
2 -1.531468  4.417328  4.086117  1972-01-01 00:00:00
3  3.961985  0.588490  1.388244  1973-01-01 00:00:00
4  3.090142 -2.857340 -0.198687  1974-01-01 00:00:00


2\. Convert the code from #1 above into __two separate functions and add a third function__. All interactions with the database should be done with the __context manager__. Make sure you use the parameter/argument and function names given in the instructions. (30 points)

- The first function is called create_randoms and takes the following as parameters/arguments to create the dataframe:
    - n_obs which is the number of observations
    - n_cols which is the number of columns
    - rand_mean with a __default__ of 0 which is the mean for the random numbers
    - rand_std with a __default__ of 1 which is the standard deviation for the random numbers
    - start_yr which is a __string representation__ of the year that the Date column starts with


- The second function named write_db is called by create_randoms (first function) and takes the following parameters to write out the dataframe to the database.:
    - df which is the datafame to write out to the database
    - table_name which is the name of the table
    - db_name without the .db which is the name of the database
        - (Note: you will need to append the .db before writing out the database)
    - folder_nm with a default of 'Data' which is the name of the folder within the current working directory to write out to.
    - the arguments needed to write to the database can be passed through the create_randoms function as explicitly named parameters OR passed as kwargs for 10 bonus points.
    
    
- The third function named read_db takes the following parameters to read from a sqlite3 database and return it as a dataframe:
    - table_name which is the name of the table
    - database name without the .db which is the name of the database
        - (Note: you will need to append the .db before reading from the database)
    - folder_nm with a __default__ of 'Data' which is the name of the folder within the current working directory to read from.
    - if there is __any__ error in reading the database file, the function should return an empty dataframe.
    - the function should use __context manager__ to read the database

In [3]:
# Answer for Question 2

def write_db(df, table_name, db_name, folder_nm='Data'):
    if not os.path.isdir(folder_nm):         
        os.mkdir(folder_nm)  
    filepath='{}/{}.db'.format(folder_nm, db_name)
    with sq3.connect(filepath) as con1:
        df.to_sql(table_name, con=con1, if_exists='replace', index=False)
        
def create_randoms(n_obs, n_cols, start_yr, rand_mean=0, rand_std=1, **kwargs):
    randoms = pd.DataFrame(data = np.random.normal(rand_mean, rand_std, (n_obs,n_cols))
                   ,columns=['Rand{}'.format(i) for i in range(n_cols)])
    randoms['Date'] = [datetime.datetime(int(start_yr)+i, 1, 1) for i in range(n_obs)]
    write_db(df=randoms, **kwargs)

# Database parameter name was given as db_name so that same kwargs could be used for both methods
def read_db(table_name, db_name, folder_nm="Data"):
    filepath = "{}/{}.db".format(folder_nm, db_name)
    
    try:
        with sq3.connect(filepath) as con2:
            return pd.read_sql('Select * From {}'.format(table_name), con2)
    except:
        return pd.DataFrame()
        

# Validate Answer 2
kwargs = {"table_name": "table_test", "db_name": "randoms_test"}
create_randoms(100,4,1800,**kwargs)
print(read_db(**kwargs, folder_nm="Data"))


       Rand0     Rand1     Rand2     Rand3                 Date
0   0.371569 -1.450260  0.062045  0.453386  1800-01-01 00:00:00
1  -0.233001  0.321680 -0.242663 -0.137057  1801-01-01 00:00:00
2   0.339116 -1.651602  1.018467  0.530654  1802-01-01 00:00:00
3   0.726967 -0.674302  0.152275  0.342939  1803-01-01 00:00:00
4  -0.452135  0.640686  0.969496  0.498372  1804-01-01 00:00:00
..       ...       ...       ...       ...                  ...
95  0.090756  0.087312  0.344476 -1.125269  1895-01-01 00:00:00
96  1.050141 -1.315305  0.193629 -0.188231  1896-01-01 00:00:00
97 -0.793833  0.241988 -1.536161  0.522117  1897-01-01 00:00:00
98  0.670620 -0.183665  1.865548 -0.467289  1898-01-01 00:00:00
99 -0.035885 -0.041587 -0.779114  1.196515  1899-01-01 00:00:00

[100 rows x 5 columns]


#### Test create_randoms calls if using expicitly named pass through parameters
- DO NOT Make Changes. Just run the cells

In [4]:
# test your create_randoms function call

create_randoms(10,4,'1980', rand_mean=1, rand_std=2, table_name='randoms', db_name='numbers2')

In [5]:
# test your create_randoms function call

create_randoms(10,4,'1980', table_name='randoms', db_name='numbers2')

#### Test create_randoms calls if using kwargs (Bonus)
- DO NOT Make Changes. Just run the cells

In [6]:
# test your create_randoms function call

write_params={'table_name':'randoms', 'db_name':'numbers2'}

create_randoms(10,4,'1980', rand_mean=1, rand_std=2,**write_params)

In [7]:
# test your create_randoms function call

create_randoms(10,4,'1980',**write_params)

#### Test read_db calls
- DO NOT Make Changes. Just run the cells

In [8]:
# test your read_db function call

df = read_db('randoms','numbers2')
df.head()

Unnamed: 0,Rand0,Rand1,Rand2,Rand3,Date
0,0.118427,-2.017402,-0.515444,0.814014,1980-01-01 00:00:00
1,0.206143,0.21708,-0.561877,-0.355409,1981-01-01 00:00:00
2,0.386567,-1.705196,0.897876,0.811513,1982-01-01 00:00:00
3,-0.276579,-1.482535,1.676595,0.760938,1983-01-01 00:00:00
4,-1.555866,1.076846,1.276294,0.128674,1984-01-01 00:00:00


In [9]:
# test your read_db function call

df = read_db('randoms','numbers3')

print(type(df))
df.head()

<class 'pandas.core.frame.DataFrame'>


3\. Read all aspects of the question carefully and following the directions closely. Do not assume anything about the composition of combined_df. (40 points)

- Load the LaborSheetData.csv file (from ICON) into a new dataframe called <i>ls_df</i>. You can download the file from ICON to your computer and then load it from your computer. OR if using IDAS, you can copy the file from the shared classdata folder to your IDAS drive and load it from there.

- __Your code to load the dataframe__ should indicate the TimeStamp column is to be __parsed as a date and the first row contains headers__. 


- Add a new column called <i>SalesDif</i> to the dataframe using vectorization that is absolute value of the difference between Projected Sales and Sales.


- Update any entries missing a manager with the value 'None'.


- Add a <i>Month</i> column to ls_df that is the __3-character__ abbreviated text representation of Month (e.g. 'Jan', 'Feb', 'Mar', 'Jun', 'Jul') as a categorical variable.


- Load the web-users.csv file (from ICON) into a new dataframe called <i>wu_df</i>. You can download the file from ICON to your computer and then load it from your computer. OR if using IDAS, you can copy the file from the shared classdata folder to your IDAS drive and load it from there.

- __Your code to load the dataframe__ should indicate the __Date column is to be parsed as a date, the Date column is the index, and the first row contains headers__. 


- Create a new dataframe called <i>ls2_df</i> that has __just__ the date component from ls_df.TimeStamp as the index and __just__ ls_df.SalesDif as the only column. The dataframe should only contain rows from ls_df where the Manager is David H and the Store is 4007. 


- Using the pandas join method, join the two dataframes (wu_df & ls2_df) using the intersection of the date indices and store the result in a new dataframe called <i>combined_df</i>. The combined_df should have the USERS from wu_df as the first column.  The index value from the resulting combined_df will not be unique, so do not be concerned with that.

- Get the average Sales by Month from ls_df and store the result in a __DataFrame__ called __avgs__.

In [10]:
# Answer for Question 3

ls_df = pd.read_csv("LaborSheetData.csv",parse_dates=["TimeStamp"], header=0)
ls_df["SalesDif"] = abs(ls_df['Projected Sales']-ls_df.Sales)
ls_df['Manager'] = np.where(ls_df.Manager.isnull(), "None", ls_df.Manager)
ls_df["Month"] = pd.Categorical([ls_df.TimeStamp[i].strftime("%b") for i in range(len(ls_df))])
     
wu_df = pd.read_csv("web-users.csv",parse_dates=["DATE"], index_col="DATE", header=0)

ls2_df = pd.DataFrame(data=[ls_df.SalesDif[i] for i in range(len(ls_df)) if ls_df.Store[i] == 4007 if ls_df.Manager[i] == "David H"], index=[ datetime.datetime.strftime(ls_df.TimeStamp[i],'%Y-%m-%d') for i in range(len(ls_df)) if ls_df.Store[i] == 4007 if ls_df.Manager[i] == "David H"])
ls2_df.columns=['SalesDif']
ls2_df.index.name='TimeStamp'

# Indexes of both dataframes were changed to same date format
combined_df = wu_df.join(ls2_df)

avgs = ls_df.groupby(["Month"]).Sales.mean()


In [11]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)
print(type(avgs))

avgs

<class 'pandas.core.series.Series'>


Month
Apr    490.335702
Aug    482.803510
Feb    469.684367
Jan    572.071429
Jul    464.799242
Jun    507.725735
Mar    502.673863
May    499.476950
Sep    474.336513
Name: Sales, dtype: float64

In [12]:
# Check your answer here. (Do not make any change to this cell. Just run this cell.)

combined_df

Unnamed: 0,USERS,SalesDif
2016-01-01,73404,
2016-01-02,66795,
2016-01-03,57185,
2016-01-04,106916,
2016-01-05,99982,
...,...,...
2019-10-02,164249,
2019-10-03,159935,
2019-10-04,159628,
2019-10-05,131067,
