## BMI 6018 Term Project
### Members: Anwar Alsanea, Ryan Williams, and Md Imdadul Islam

## Presentation (public):

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("89dHp8AwTeo")

## Introduction:

Our term project involves dealing with daily gas produced and feed for [digesters](https://en.wikipedia.org/wiki/Anaerobic_digestion) in wastewater treatment plants (WWTP).
We have obtained data from a local WWTP [Central Valley Water Reclamation Facility CVWRF](https://www.cvwrf.org/).
![Anaerobic Digester](http://www.e-p.com/file/5dc89542-d2fd-45e6-ba17-1f02d1d36181)

The Central Valley Water Reclamation Facility maintains an anaerobic digestion program to produce methane gas from liquid waste. Methane gas is produced as microbes in the digester break down and feed on the organic material in the liquid waste. The methane gas is produced in trillions of cubic feet (tcf) and is used to power up to 70% of the facility on a given day. We contacted Phil Heck, the assistant plant manager, to see what kind of data the facility tracks on anaerobic digestion. The data we received had been dumped into an excel file, never to be seen or utilized. As a group, our goal was to easily export that data from excel into Python, create a Pandas Data Frame, and make the information more useful to the facility. We hope this resource might help the management team more easily track digester efficiency, model how the feed impacts the gas production, and provide a tool to determine where improvements and additions could be made to the system.


# Data:
Data will be imported into a dataframe from a csv file that contains gas produced in TCF (Trillion Cubic Feet) and total inputted feed in Gallons for month January - September (excluding August) for the year 2017.

In [None]:
import pandas as pd
import numpy as np
from ipywidgets import interact, fixed
import matplotlib.pyplot as plt
import seaborn as sns
import importlib
import imdadplot as ip
from aclass import *
from anwarfunctions import *

Importing the data and renaming some columns:

In [None]:
DF = pd.read_csv('CVWRF.csv', header = 1, na_values="NaN")
DF.rename(columns={'Gas produced (TCF)': 'egg1_gasproduced', 'Feed (Gallons)': 'egg1_feed',
                  'Gas produced (TCF).1':'egg2_gasproduced','Feed (Gallons).1':'egg2_feed'}, inplace=True)


Create two columns that will combine both egg digesters gas production and incoming feed:

In [None]:
DF['total_gas'] = DF['egg1_gasproduced'] + DF['egg2_gasproduced']
DF['total_feed'] = DF['egg1_feed'] + DF['egg2_feed']
DF=DF.query("egg1_gasproduced>0")
# to view the data:
DF.head(5)

Creating smaller dataframes for each month:

In [None]:
DF_jan = DF[DF.Month == 'January']
DF_feb = DF[DF.Month == 'February']
DF_march = DF[DF.Month == 'March']
DF_april = DF[DF.Month == 'April']
DF_may = DF[DF.Month == 'May']
DF_june = DF[DF.Month == 'June']
DF_july = DF[DF.Month == 'July']
# there is no August data
DF_sept = DF[DF.Month == 'September']
month_list = [DF_jan,DF_feb,DF_march,DF_april ,DF_may,DF_june,DF_sept]

# Class:
test the clean class dataframe on the yearly dataframe:

In [None]:
DD = CleanDf(DF)
DD.__repr__()

now test the other class which will produce the average and total of a specified column:

In [None]:
DDD = stats_data ('total_feed',DF)
DDD.__repr__()

# Using class to create lists and dictionaries:

First, use class to create a list of averages and totals for incoming feed and gas psoduced in each month:

In [None]:
# list comprehension:
year_feed_stats = [stats_data('total_feed',month).__str__() for month in month_list]
year_gas_produced_stats = [stats_data('total_gas',month).__str__() for month in month_list]

check what they look like:

In [None]:
year_feed_stats

We can make the dictionary that will contain month as keys and average and sum as values for each incoming feed and gas production:

In [None]:
# make_dictionary is a function that will create a dictionary from two lists:
monthly_feed = make_dictionary(month_list,year_feed_stats)
monthly_gas = make_dictionary(month_list,year_gas_produced_stats)

View both dictionaries:

In [None]:
monthly_feed

In [None]:
monthly_gas 

From our analysis of the data, the highest incoming feed and the highest gas production was in the month March.


## Data visualization:

First we will create seperate dataframes again:

In [None]:
DF_1 = DF[["egg1_gasproduced","egg2_gasproduced"]]

The first plot is a histogram to compare the gas prodcution in the two egg digesters:

In [None]:
DF_1.plot(kind="hist", bins=50, color = ['green','black'], \
          alpha=0.5, title="histogram of both egg gas production")

This second plot is a bar plot that shows the median value as well as first and third quartile values to compare egg digesters 1 and 2. 

In [None]:
DF_1.plot(kind="box")

Now we are going to use the class stats_data from the module called aclass to verify the above findings. The normalized gas production (i.e. gas produced divided by the amount of feed) value was 4.9 for egg1 and 4.7 for egg2, which proves that digester 1 is more efficient.

In [None]:
egg1_feed = stats_data ('egg1_feed',DF)
egg1_gas = stats_data ('egg1_gasproduced',DF)
egg2_feed = stats_data ('egg2_feed',DF)
egg2_gas = stats_data ('egg2_gasproduced',DF)
normalized_egg1_gasproduction = egg1_gas.__str__()[0][1] / egg1_feed.__str__()[0][1] 
normalized_egg2_gasproduction = egg2_gas.__str__()[0][1] / egg2_feed.__str__()[0][1] 
# egg 1 is the better one!

Next we created a scatter plot using DataFrame.plot() function and kind = 'scatter', but our scatter plot looked so dense. That is why we decided to use hexabin plot. Hexabin plot was created using DataFrame.plot() function and kind = 'hexbin'.

In [None]:
DF_1.plot(kind='scatter',x='egg1_gasproduced',y='egg2_gasproduced')

In [None]:
DF_1.plot(kind='hexbin',
                    x='egg1_gasproduced',
                    y='egg2_gasproduced',
                   gridsize=25)

In the next plot we have two scatter plot of gas production vs feed for both of the digester. This graph was created using DataFrame.plot.scatter(). The graph showed that both the digester produces maximum gas when the feed is about 100,000 gallons.

In [None]:
print(DF.plot.scatter(x="egg1_feed",y="egg1_gasproduced"))
print(DF.plot.scatter(x="egg2_feed",y="egg2_gasproduced"))

In [None]:
importlib.reload(ip)

The plot next is an interactive plot created using the function called daily_variation_by_month which is under the module named imdadplot.py. The function goes into another function called interact imported from ipwidget. Function daily_variation_by_month takes two keyword arguments: df_glob and month. In this plot we can choose the month that we want to visualize and the graph shows the daily variation of gas produced in egg1.

In [None]:
interact(ip.daily_variation_by_month, df_glob=fixed(DF), \
    month=["January","February","March","April","May","June","July","September"])

Finally we created another interactive plot using the function monthly_boxplot_by_egg from the same previous module imdaplot.py. This also takes help of interact function. The function monthly_boxplot_by_egg function takes two keyword argument: df_glob and egg. In this plot we can choose the number of egg from the widget and the graph shows boxplot of gas production in different month. 

In [None]:
DF_egg1=DF[["Month","Day","Year","egg1_gasproduced","egg1_feed","total_gas","total_feed"]]
DF_egg2=DF[["Month","Day","Year","egg2_gasproduced","egg2_feed","total_gas","total_feed"]]
DF_egg1.rename(columns={'egg1_gasproduced': 'gas', 'egg1_feed': 'feed'}, inplace=True)
DF_egg2.rename(columns={'egg2_gasproduced': 'gas', 'egg2_feed': 'feed'}, inplace=True)
DF3=DF_egg1.append(DF_egg2)
DF3['egg_number']=[1]*242 + [2]*242

In [None]:
interact(ip.monthly_boxplot_by_egg, df_glob=fixed(DF3), egg=["1","2"])

# Thank you:
This concludes our analysis for the digester data obtained from CVWRF