# <center>Week 2 Assignment</center>

## Part 1

As suggested in Use Case 1 of the FTE, we need to read the other nine stock files and insert their data into the database. While nine files isn't an undue burden to manually read, we are going to look ahead to the time when we may have 100 log files to read and implement this code using the DRY Principle. DRY stands for

* Don't 
* Repeat
* Yourself

So, while we could create an individual cell to read each stock file, for this assignment, do it in one. I'll give you some help to get started.


<hr>

_Reference:_ <br>
https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

Much of computer science and programming is about identifying and exploiting patterns. In this case, we know there is a pattern to how the files are named, and we know (through inspection) that there is a pattern to the columns of data inside. 

The file names all start with a year, from 2009 straight to 2019 with no breaks:

```
2009_aapl_data.xlsx
2010_aapl_data.xlsx
2011_aapl_data.xlsx
2012_aapl_data.xlsx
2013_aapl_data.xlsx
2014_aapl_data.xlsx
2015_aapl_data.xlsx
2016_aapl_data.xlsx
2017_aapl_data.xlsx
2018_aapl_data.xlsx
2019_aapl_data.xlsx
```

Do we know of anything in Python capable of generating a **range** of numbers like that?

In [None]:
for x in range(2009, 2020):
    print(x)

That should be enough to get you going. Feel free to use code and helper functions from the FTE. I gave you the .ipynb file for a reason :).

Remember, 

* If you ran the FTE code, it will have already read in 2009 and created the database file. 
* If you start at 2009, depending on the condition above, you may need to handle the table already existing.

It is your choice whether or not to use the dataset library. As with many things in life, there are tradeoffs -- some things are easier, some not. 

**Deliverable:**

When you are done, you will have a database table with slightly more than 2500 rows in it. **Show this by doing a query that counts rows in the table.**

In [1]:
import os
import dataset
from openpyxl import load_workbook, Workbook
import sqlite3
import pandas as pd

In [2]:
path = os.getcwd()                                   # set current working dir
files = os.listdir('data')                           # Sets os to look in working dir
files_xlsx = [f for f in files if f[-4:] == 'xlsx']  # list comp to create list of all files if they end with xlsx
wb = Workbook()                                      # Creating blank workbook
ws = wb.active                                       # grab active worksheet
wb = load_workbook('data/2009_aapl_data.xlsx')       # explicitly loading first workbook to retain header, loop later
ws = wb.active                                       # grab active worksheet in current workbook
sheet = wb['Sheet']                                  # geting all columns and rows
sheet['A1'].value

'date'

In [3]:
print("Rows: ", sheet.max_row)

Rows:  83


In [4]:
print("Columns: ", sheet.max_column)

Columns:  6


In [5]:
last_point = ws.cell(row = sheet.max_row + 1, column = 1).coordinate
# print("Last data point in current worksheet:", last_point 

In [6]:
files_xlsx_path = []                       # empty list to store file names to loop throug
for file in files_xlsx:                    # loop through files in current path
    if file != '2009_aapl_data.xlsx':      # 2009 is active wb so i dont want to add it to the list
        file = str('./data/') + file       # add './data/' to file so os has correct full path
        files_xlsx_path.append(file)       # append ./data/file to files_xlsx_path

In [7]:
for f in files_xlsx_path:
    print(f + ' is being processed now')
    wb2 = load_workbook(f)
    ws2 = wb2['Sheet']
    for value in ws2.iter_rows(min_row=2, values_only=True):
            ws.append(value)      

./data/2014_aapl_data.xlsx is being processed now
./data/2013_aapl_data.xlsx is being processed now
./data/2012_aapl_data.xlsx is being processed now
./data/2015_aapl_data.xlsx is being processed now
./data/2017_aapl_data.xlsx is being processed now
./data/2010_aapl_data.xlsx is being processed now
./data/2018_aapl_data.xlsx is being processed now
./data/2019_aapl_data.xlsx is being processed now
./data/2011_aapl_data.xlsx is being processed now
./data/2016_aapl_data.xlsx is being processed now


In [10]:
print("Rows: ", sheet.max_row)

Rows:  2517


In [9]:
                                                        # delete blank row between first sheet and looped sheets
index_row = []
for x in range(1, ws.max_row):                          # loop each row in first column
    if ws.cell(x,1).value is None:                      # find empty cells
        index_row.append(x)                             # collect index of blank cells
        
for row_del in range(len(index_row)):                   # loop through the blank index values
    ws.delete_rows(idx=index_row[row_del], amount=1)    # delete blank rows
    index_row = list(map( lambda k: k -1, index_row))   # exclude offset of row through each loop/iteration

In [11]:
saveName = 'Master.xlsx'                                     # Creating save file name
full_file = os.path.abspath(os.path.join('data', saveName))  # Saving absolute path to save file
wb.save(full_file)                                           # Saving workbook to full_file path

## Part 2

Now that you have a working database with a reasonable amount of data in it, do some queries with it and show the data:

1. Find all days where the stock closed lower than 25. 
    * Print a count of how many
    * Print the first 5 rows found
2. Find all days in 2017 where the stock closed above 35.
    * Print a count of how many
    * Print the last 5 found.
    
**Deliverable:**

3. Create a new workbook and put each query result on a new worksheet in the workbook. Remember to save it to disk.


In [12]:
def isfloat(value):
  try:
    float(value)
    return True
  except ValueError:
    return False

# Note: These have to be tested in the right order. isfloat() reports True for integers.
def get_type(value):
    if value.isdigit():
        return dataset.types.Integer
    elif isfloat(value):
        return dataset.types.Float
    elif '/' in value:
        return dataset.types.Date
    else:
        return dataset.types.Uni

In [23]:
apl = pd.read_excel('./data/Master.xlsx',)  # reading in Master.xlsx into a pandas data frame

In [24]:
apl['date'] = pd.to_datetime(apl['date'], format='%Y/%m/%d')  #formating date column to datetime64 type

In [75]:
apl.dtypes

date      datetime64[ns]
close            float64
volume           float64
open             float64
high             float64
low              float64
dtype: object

In [59]:
conn = sqlite3.connect('apl.db')
c = conn.cursor()

In [None]:
def create_table():
    c.execute('CREATE TABLE IF NOT EXISTS Aplsp (date TEXT,close FLOAT,volume FLOAT, \
              open FLOAT, high FLOAT, low FLOAT)')

In [None]:
create_table()

In [79]:
# testing adding data with a dynamic loop
# def dynamic_data_entry(df):
#     for row in df.itertuples():
#         date = row.date
#         close = row.close
#         value = row.volmue
#         openP = row.open
#         high = row.high
#         low = row.low
#         c.execute("INSERT INTO Aplsp (date, close, volume, open, high, low) VALUES(?, ?, ?, ?, ?, ?)",
#                  (date, close, value, openP, high, low))
#         conn.commit()

----------------------------------------
2009-12-31 00:00:00 30.1046 87907426.0
----------------------------------------
2009-12-30 00:00:00 30.2343 102705781.0
----------------------------------------
2009-12-29 00:00:00 29.8714 110755363.0
----------------------------------------
2009-12-28 00:00:00 30.23 160784168.0
----------------------------------------
2009-12-24 00:00:00 29.8628 125222058.0
----------------------------------------
2009-12-23 00:00:00 28.8714 86118086.0
----------------------------------------
2009-12-22 00:00:00 28.6228 87148416.0
----------------------------------------
2009-12-21 00:00:00 28.3186 152166116.0
----------------------------------------
2009-12-18 00:00:00 27.9186 151863506.0
----------------------------------------
2009-12-17 00:00:00 27.4086 96720359.0
----------------------------------------
2009-12-16 00:00:00 27.8614 88203036.0
----------------------------------------
2009-12-15 00:00:00 27.7386 104742851.0
-----------------------------------

In [60]:
try:
    apl.to_sql('aplsp', conn)
except ValueError as v_error:
    print(v_error)

Table 'aplsp' already exists.


In [28]:
cursor = conn.execute('select name from sqlite_master where type = "table";') # select all tables

In [29]:
type(cursor)

sqlite3.Cursor

In [30]:
#fetch all tables in our database
cursor.fetchall()

[('aplsp',)]

In [31]:
cursor = conn.execute('select * from aplsp;')  # selecting all column names from the aplsp table 

In [32]:
#list of columns in our database
for col in cursor.description:
    print(col[0])

index
date
close
volume
open
high
low


In [62]:
conn.execute('select * from aplsp where close > 25 ORDER BY close asc limit 5').fetchall()

[(75, '2009-09-15 00:00:00', 25.0228, 103726871.0, 24.8628, 25.0928, 24.7986),
 (63, '2009-10-01 00:00:00', 25.8371, 130857059.0, 26.4786, 26.6028, 25.8143),
 (74, '2009-09-16 00:00:00', 25.9814, 187437586.0, 25.4271, 26.1071, 25.4114),
 (67, '2009-09-25 00:00:00', 26.0528, 111258383.0, 26.0014, 26.5, 25.92),
 (68, '2009-09-24 00:00:00', 26.26, 137146491.0, 26.7428, 26.8143, 26.11)]

In [43]:
# used if the previous statement was saved to a variable data
for line in data:
    print(f'{line[0]}\t{line[1]}\t{line[2]}\t{line[3]}\t{line[4]}\t{line[5]}\t{line[6]}')
        


75	2009-09-15 00:00:00	25.0228	103726871.0	24.8628	25.0928	24.7986
63	2009-10-01 00:00:00	25.8371	130857059.0	26.4786	26.6028	25.8143
74	2009-09-16 00:00:00	25.9814	187437586.0	25.4271	26.1071	25.4114
67	2009-09-25 00:00:00	26.0528	111258383.0	26.0014	26.5	25.92
68	2009-09-24 00:00:00	26.26	137146491.0	26.7428	26.8143	26.11


In [63]:
conn.execute("select count(*) from aplsp where date like ('2017%') and close < 35").fetchall()

[(0,)]

In [64]:
c.close()
conn.close()

In [66]:
#Dataset
db = dataset.connect("sqlite:///apl.db")

In [67]:
db.tables

['aplsp']

In [68]:
db['aplsp'].columns

['index', 'date', 'close', 'volume', 'open', 'high', 'low']

In [71]:
high_close = db['aplsp'].find(close = {'>=': 30})
for i, row in enumerate(high_close):
    if i < 10:
        print(f"{row['date']} {row['close']} {row['volume']} {row['open']} {row['high']} {row['low']}")

2009-12-31 00:00:00 30.1046 87907426.0 30.4471 30.4786 30.08
2009-12-30 00:00:00 30.2343 102705781.0 29.8328 30.2857 29.7586
2009-12-28 00:00:00 30.23 160784168.0 30.2457 30.5643 29.9444
2014-12-31 00:00:00 110.38 41304780.0 112.82 113.13 110.21
2014-12-30 00:00:00 112.52 29798660.0 113.64 113.92 112.11
2014-12-29 00:00:00 113.91 27533430.0 113.79 114.77 113.7
2014-12-26 00:00:00 113.99 33681200.0 112.1 114.52 112.01
2014-12-24 00:00:00 112.01 14479610.0 112.58 112.71 112.01
2014-12-23 00:00:00 112.54 25991030.0 113.23 113.33 112.46
2014-12-22 00:00:00 112.94 45097060.0 112.16 113.49 111.97
