# Comparing Public School Funding and Mortality Rates Due to Drug Abuse and Self-Harm In Ohio
>**CS3900 Final Project**

>Jonathon Gebhardt, Brian Duffy, Alexander Silcott

<img src="header.jpg" />
Source: https://www.1and1.com/digitalguide/online-marketing/web-analytics/what-is-machine-learning-how-machines-learn-to-think/

## Introduction
It is common knowledge that in recent years there has been an increase in drug-related deaths--especially in Ohio. There are any number of reasons for a spike in this type of mortality rate which can include economic and environmental factors. Self worth is also usually drawn from these factors. Educated individuals with a better school experience might be deterred from engaging in behaviors that would result in a death by overdose or suicide than individuals with a poor school experience (i.e. no art or after school programs, poor environment in class).

**We hope to address the following questions:**
- Is there a relationship between the amount of funding public schools get and the mortality rates of individuals in those areas due to drug abuse or self-harm?

- Can we predict if a change to a school district’s budget will have an effect on the mortality rate of a particular area?

- Are there specific areas school’s can spend money on that can reduce these mortality rates and perhaps help general public health as a result?

## Files included

### Python scripts
- **csvconvert.py** - Convert given xlsx file to csv.
- **preprocess.py** - Preprocess csv files. Find intersection and complement of our datasets so we can build a combined dataset using a foreign key.
- **trim.py** - Trim off extra columns we don't need reduce features in data.

### Datasheets and other stuff
- **jupyter-notebook.ipynb** - Jupyter notebook to present information.
- **grad.csv** - Graduation information about school districts. Used to cross-reference IRNs to get collated data.
- **district.csv** - School expenditure information for Ohio.
- **mort.csv** - Average mortality rate by county. (Deaths per 100,000)
- **expanded.csv** - All of the schools combined, before finding intersection and complement.
- **expanded_complement.csv** - All of the schools not in the intersection.
- **expanded_intersection.csv** - Combined dataset before trimming columns.
- **expanded_intersection_trimmed.csv** - Final combined dataset after trimming columns.

### csvconvert.py
**To begin, we created a script csvconvert.py which converts our input xlsx files into csv files.**
It can be run from the command line, taking the file name of the xlsx file, the name of the desired sheet, followed by the output filename of the csv. In this python-friendly form, we can interpret and manipulate them inside of Sci-kit and Jupyter.

In [None]:
# %load csvconvert.py
#!/usr/bin/env: python3

'''

Source: https://stackoverflow.com/questions/20105118/convert-xlsx-to-csv-correctly-using-python

'''

import xlrd
import csv
import argparse
import os.path

def csv_from_excel():
    wb = xlrd.open_workbook('excel.xlsx')
    sh = wb.sheet_by_name('Sheet1')
    your_csv_file = open('your_csv_file.csv', 'w')
    wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)

    for rownum in range(sh.nrows):
        wr.writerow(sh.row_values(rownum))

    your_csv_file.close()

# runs the csv_from_excel function:
#csv_from_excel()

parser = argparse.ArgumentParser()
parser.add_argument("file_name", help="name of excel workbook to be converted")
parser.add_argument("sheet_name", help="name of excel workbook sheet to be converted")
parser.add_argument("output_file_name", help="name of output file")
args = parser.parse_args()

if os.path.isfile(args.file_name):
    wb = xlrd.open_workbook(args.file_name)
    sh = wb.sheet_by_name(args.sheet_name)
    your_csv_file = open(args.output_file_name, 'w')
    wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)

    for rownum in range(sh.nrows):
        wr.writerow(sh.row_values(rownum))

    your_csv_file.close()

else:
    print("Error: File " + args.file_name + " does not exist")


### preprocess.py

**Preprocess.py creates a dictionary and checks all of the district schools.**
In this way we know that we found matches for all of the school's identifying IRNs. We are then able to cross reference multiple data sets.

In [37]:
# %load preprocess.py
#!/usr/bin/env python3

import csv
import pandas as pd
import numpy as np

def checkIntersection():
    matched = []
    count = 0
    for row in districtInfo:
        for irns in irnLookups.values():
            if int(float(row[1])) in irns:
                matched.append(int(float(row[1])))
                break


    for row in districtInfo:
        if int(float(row[1])) not in matched:
            print('Did not match', row[1], 'to irnLookups dictionary')


    print('Matched', len(matched), 'out of', len(districtInfo), 'schools from districtInfo to irnLookups')


# Print list of matching IRNs
def printLookups(irn_list):
    for i in irnLookups:
        print (i, irnLookups.get(i))


# grad.csv is what helps us map IRN's to County names
# Build IRN dictionary: (String)county name: (List)IRN
irnLookups = {}
with open('grad.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        # Row 2 is district name
        # Row 0 is IRN
        if row[2] not in irnLookups:
            if row[2] != 'County':
                irnLookups[row[2]] = [int(row[0])]
        else:
            temp = irnLookups.get(row[2])
            temp.append(int(row[0]))
            irnLookups[row[2]] = temp


# mort.csv is all of the mortality data and is categorized by 
# county name, which is why we had to relate IRN's to county
# names.
# Build mortality
mortality = []
line = 0
with open('mort.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        # Ohio lines are between 2081 and 2170
        if line > 2081 and line < 2170:
            temp = row[0].split(",")
            mortality.append(row)

        # If we get to line 2170, we're done
        if line >= 2170:
            break

        line += 1


# Begin parsing district.csv, which contains various information
# on schools by district
districtInfo = []
with open('district.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        districtInfo.append(row)


# Trim off first two rows which are column names and empty space
districtInfo = districtInfo[2:]

# load a csv
df = pd.read_csv('grad.csv')

In [43]:
df

Unnamed: 0,District IRN,District Name,County,Region,Address,City and Zip Code,Phone #,Superintendent,Four Year Longitudinal Graduation Rate - Class of 2016,Letter Grade of 4 year Graduation Rate 2016,Four Year Graduation Rate Numerator - Class of 2016,Four Year Graduation Rate Denominator - Class of 2016,Four Year Graduation Rate - Similar District Average,Four Year Graduation Rate - Statewide Average,Watermark
0,442,Manchester Local,Adams,Region 14,130 Wayne Frye Dr,"Manchester, OH, 45144-9314",(937) 549-4777,Brian E. Rau,95.4,A,62,65,90.6,83.6,
1,43489,Akron City,Summit,Region 8,70 N Broadway St,"Akron, OH, 44308-1911",(330) 761-1661,David W. James,74.3,F,1185,1595,74.8,83.6,
2,43497,Alliance City,Stark,Region 9,200 Glamorgan St,"Alliance, OH, 44601-2946",(330) 821-2100,Jeffery S. Talbert,88.6,C,179,202,82.7,83.6,
3,43505,Ashland City,Ashland,Region 7,PO Box 160,"Ashland, OH, 44805-0160",(419) 289-1117,Douglas J. Marrah,94.2,A,226,240,92.6,83.6,
4,43513,Ashtabula Area City,Ashtabula,Region 5,2630 W 13th St,"Ashtabula, OH, 44004-2405",(440) 992-1201,Melissa D. Watson,78.6,F,217,276,84.8,83.6,
5,43521,Athens City,Athens,Region 16,25 S Plains Rd,"The Plains, OH, 45780-1333",(740) 797-4516,Thomas J. Gibbs,89.1,B,179,201,93.6,83.6,
6,43539,Barberton City,Summit,Region 8,479 Norton Ave,"Barberton, OH, 44203-1737",(330) 753-1025,Jeffrey M. Ramnytz,86.7,C,247,285,86.8,83.6,
7,43547,Bay Village City,Cuyahoga,Region 3,377 Dover Center Rd,"Bay Village, OH, 44140-2304",(440) 617-7300,Clinton L. Keener,98.3,A,172,175,96.8,83.6,
8,43554,Beachwood City,Cuyahoga,Region 3,24601 Fairmount Blvd,"Beachwood, OH, 44122-2298",(216) 464-2600,Robert P. Hardis,97.6,A,120,123,96.3,83.6,
9,43562,Bedford City,Cuyahoga,Region 3,475 Northfield Rd,"Bedford, OH, 44146-2201",(440) 439-1500,Andrea Celico,84.0,C,247,294,85.0,83.6,


### Pull in checkIntersection definition

In [26]:
from preprocess import checkIntersection

### Validate intersection between two data sets
Here we are able to see that we have a complete set of IRNs from both datasets that are compared (_Graduation Rates_ and the _Mortality_ datasets).

In [44]:
checkIntersection()

Matched 607 out of 607 schools from districtInfo to irnLookups


### Build a dictionary for the IRN lookups and print them to the screen
We do this by grabbing the path to the file and importing the printLookups definition.

In [45]:
import csv

# Pull in printLookups definition
from preprocess import printLookups

# Build IRN dictionary: (String)county name: (List)IRN
irnLookups = {}
with open('grad.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        # Row 2 is district name
        # Row 0 is IRN
        if row[2] not in irnLookups:
            if row[2] != 'County':
                irnLookups[row[2]] = [int(row[0])]
        else:
            temp = irnLookups.get(row[2])
            temp.append(int(row[0]))
            irnLookups[row[2]] = temp

# Print list of matching IRNs         
printLookups(irnLookups)

Adams [442, 61903]
Summit [43489, 43539, 43836, 44552, 44834, 44883, 49973, 49981, 49999, 50005, 50013, 50021, 50039, 50047, 50054, 50062, 50070]
Stark [43497, 43711, 44354, 44503, 49833, 49841, 49858, 49866, 49874, 49882, 49890, 49908, 49916, 49924, 49932, 49940, 49957]
Ashland [43505, 45468, 45823, 45831]
Ashtabula [43513, 43810, 44057, 45856, 45864, 45872, 45880]
Athens [43521, 44446, 45906, 45914, 45922]
Cuyahoga [43547, 43554, 43562, 43612, 43646, 43653, 43786, 43794, 43901, 43950, 43976, 44040, 44198, 44305, 44370, 44529, 44545, 44636, 44701, 44750, 44792, 44842, 45005, 45062, 45286, 46557, 46565, 46573, 46581, 46599, 46607]
Belmont [43570, 44347, 45203, 45237, 45997, 46003, 46011]
Logan [43588, 48074, 48082, 48090]
Huron [43596, 44560, 45096, 47712, 47720, 47738, 47746]
Washington [43604, 44321, 50484, 50492, 50500, 50518]
Franklin [43620, 43802, 44073, 44800, 44933, 45047, 45070, 45138, 46946, 46953, 46961, 46979, 46995, 47001, 47019, 47027]
Wood [43638, 45583, 45609, 50674, 50

# What features can we eliminate from the dataset?

We have several options for features, and more than what we really need. Some of these columns have virtually no data in them for most schools. These are ones we can eliminate immediately based on this issue with zero values. This will help with selecting features from our dataset.

**Adult Ed** - Although there might be some kind of correlation we can draw from this feature, there are many rows in which this has no data.

**Instr Equipment** - Many rows with no data for this field; not relevant.

**Land & Structures** - Many rows with no data for this field; not relevant.

### Other features that can be eliminated?

**Community Service** - We feel that this could play a role, but unfortuantely there are too many rows with zeros for this field.

**Construction** - Not relevant to topic. Not all schools had expenditures in construction at this time.

**Debt & Interest** - Not relevant to topic. Although this will have an impact on school spending, it most likely does not have impact on topic.

**Enterprise** - Not relevant to topic. Expenditures may bee too broad to really have influence on topic.

**Food Service** - This features is probably not relevant to our topic.

**Org Type** - This feature can be removed because all of our datapoints are public districts and this column is redundant.

**Other Equipment** - This contains expenditures of non-instructional expenditures, however the rows are inconsistent with zero values.

**Pupil Transp** - This features is probably not relevant to our topic.

**Weighted ADM** - This is the weighted average daily membership, which is important for the dataset but is not a necesarry part of our process as it was already used to determine values in the rows of the dataset.

# Feature Index

Now that we have reduced the number of features to use in our algorithm, we will define them as follows:

**CRI - Classroom Instr** - Classroom instructional cost. Actual amount spent on classroom instructional purposes.

**CRI%** - Percent spent on classrom Instructional purposes.

**County Mortality Rate** - Average mortality rate (deaths per 100,000) for drug overdose in 2014 in the school's county.

**County Name** - Name of the county that the school belongs to.

**Gen Admin** - General Administration. Expenditure for board of education and executive administration (office of superintendent) services.

**IRN** - Information Retrieval Number. Identification number assigned to educational entity. We use this ID to compare our data across multiple datasets.

**Instr Staff Sup** - Instructional staff support services. Expenditures for supervision of instruction service improvements, curriculum development, instructional staff training, academic assessment, and media, library, and instruction-related technology services.

**Instruction** - Activities dealing with the interaction of teachers and students in the classroom, home, or hospital as well as co-curricular activities. Includes teachers and instructional aides or assistants engaged in regular instruction, special education, and vocational education programs. Excludes adult education programs.

**Local Education Agency Name** - Name of the entity/school, used for identification purposes.

**NCR -Nonclassroom** - Nonclassroom expenditures. This includes general administration, school administration, other and non-specified support services, opearation and maintenance of plant, pupil transportation, and Elem-Sec Noninstructional Food service.

**NCR%** - Percent spent on nonclassroom expenditures.

**Non-Operating** - Non-Operating expenditures. The sum of enterprise operations, non-instructional--Other, community services, adult aducation, non-elementary-secondary programs--Other, construction, land and existing structures, equipment (instructional and other), and payment to other governments and interest on debt.

**Oper & Maint** - Operation and maintenance of plant. Expenditure for buildings services (electricity, heating, air, insurance), care and upkeep of grounds and equipment, nonstudent transportation and maintenance; security devices.

**Operating EPEP*** - Operating expenditures per equivalent pupil. Amount spent per pupil on operating cost.

**Operating Expenditures** - Cost of instruction and support services, as well as administration and pupil transportation and food services. We left out a few of these metrics and believe we can use this value in lieu of them.

**Other Elem-Sec** - Other Elementary-secondary Noninstructional. Expenditure for other elementary-secondary non-instructional activities not related to food services or enterprise operations.

**Other Non Elem-Sec** - Other Nonelementary-secondary Programs . All other nonelementary-secondary programs such as any post-secondary programs for adults.

**Other Support** - Other and Non-specified Support Services.  Business support expenditures for fiscal services (budgeting, receiving and disbursing funds, payroll, internal auditing, and accounting), purchasing, warehousing, supply distribution, printing, publishing, and duplicating services. Also include central support expenditures for planning, research and development, evaluation, information, management services, and expenditures for other support services not included elsewhere.

**Pupil Support** - Pupil support Services. Expenditures for administrative, guidance, health, and logistical support that enhance instruction. Includes attendance, social work, student accounting, counseling, student appraisal, information, record maintenance, and placement services. Also includes medical, dental, nursing, psychological, and speech services.

**School Admin** - School Administration. Expenditure for the office of the principal services.

### *Note about EPEP
EPEP (Expenditure per Equivalent Pupil) is similar to EPP (Expenditure Per Pupil). EPP considers all pupils equal whereas EPEP has a weighted value associated with it to make it more representative of the students actually attending the district (i.e. takes into account students who attent multiple schools in the school year).

Source: http://education.ohio.gov/Topics/Finance-and-Funding/Finance-Related-Data/Expenditure-and-Revenue/Expenditure-Per-Pupil-Rankings

Source: https://education.ohio.gov/getattachment/Topics/Data/Report-Card-Resources/Financial-Data/Technical-Guidance-Finance.pdf.aspx

Source: http://www.tccsa.net/sites/tccsa.net/files/files/EMIS_Forms/Acronyms-EMIS.pdf