# DATA 512: Final Project Proposal

## Motivation and Problem Statement
The issue of climate change is something that has been significantly impacting the world since the early 19th century. Scientists have argued that greenhouse gas emissions could be contributing to global warming. Over the years, we've observed increases in global temperature, depletion of precious natural resources and severe damage to the environment. Solutions to these problems involve policies on green energy, conservation and education in climate change. Another observable factor is public opinion on climate change. Often, we see that policies around climate change, or lack thereof, indicate that the threat is one that people do not take seriously enough. In fact, a huge subsection of the population firmly believes that climate change is a hoax, and not a real problem whatsoever. For us to affect real impact in the realm of climate change, it is necessary to educate people on the matter, in conjunction with instituting progressive policy.

The topic of climate change was interesting to me for two reasons: 

i. I wanted to see (quantitatively and concretely) how human actions have contributed to changes in the climate.

ii. I wanted to analyze the "debate" around the existence of climate change.

## Background Work

**Global warming and science:**
Currently, there is a lot of [scientific proof](https://www.ucsusa.org/climate/science) to evince the existence of global warming. This [Wikipedia article](https://en.wikipedia.org/wiki/Scientific_consensus_on_climate_change) spells out the threats and effects in detail. Moreover, this evidence points to human activities as the cause of global warming. Key among the human activities causing climate change is greenhouse gas emissions. The effects of global warming and climate change are observable almost in every facet of nature ranging from rising sea levels, diminishing biodiversity, melting of ice caps, and extensive pollution. Most international bodies of natural scientists, academics, and social scientists agree that the change in climate has been happening for a while, and that the primary causes are human activities and industrial development. This consensus is enough to elicit concern for climate change.

**Public opinion on climate change:**
As [this Wikipedia article](https://en.wikipedia.org/wiki/Public_opinion_on_climate_change) suggests, public opinion on climate change concerns the views of adults on the social, environmental and political facets of climate change. It can be observed that factors like media coverage, political ideology, education levels and demographic nature of a population can adversely affect how individuals think of climate change.


## Research Questions and Hypotheses

My goal with the project was to address trends and patterns in climate change as well as public opinion around it. To this end, I divided my analysis into two main questions, with subsections devoted to various analyses. I would like to answer some of the following questions concerning both climate change itself, as well as people's perceptions regarding it.


*   **What factors influence the way people think about climate change?**


1.   Does a state's political affiliation impact the way they answer certain questions about climate change?

    **Hypothesis: Republican states are likely to have greater percentages opposing ideals of climate change and policies to mitigate global warming, than Democratic states.** 
2.   Do states with higher surface temperatures, sea levels perceive a greater threat from global warming?
3.   What are the renewable energy, fossil fuel consumption patterns of states that perceive global warming as a big threat versus those states that don't?

*   **How has surface temperature evolved over the years and what factors might have affected this evolution?**

1.   Do states that gain more energy burning fossil fuels have higher surface temperatures? Do higher temperatures correlate with more air pollution, higher sea levels, etc?
2.   What can we project the global surface temperature to be in the next 20 years? 30 years?


## Data 
The data I chose for my questions actually encompasses several datasets. The descriptions, as well as terms of use of the selected datasets are below.

1. I will use data from [Yale's Climate Perception Survey Study](https://climatecommunication.yale.edu/visualizations-data/ycom-us/), which asked respondents several questions on climate change and their opinions of it, and aggregated respondents who answered 'yes' or 'no' into separate columns, as percentages of the whole survey-taking population, by US county. The limitation of this dataset is that since it has been aggregated already, I will not be able to compare answers of the same respondent to multiple questions. It would have been interesting to analyze this in greater detail. The [Data Download](https://climatecommunication.yale.edu/visualizations-data/ycom-us/) section of this link specifies that the data may be used for research or academic purposes if cited appropriately. This dataset could possibly suffer from **selection bias**, as it only contains the views of those people who the survey was made available to.

2. My second dataset of choice will be the [Berkeley Earth Climate dataset](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data), which presents a daily progression in surface temperatures of cities, states, countries, over the past ~250 years. The drawback to this dataset is that since it goes so far back, there are a lot of missing values in the beginning and this must be dealt with. This data comes under the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/).

3. The third dataset I will use is the [MIT Election Lab data](https://electionlab.mit.edu/data) to determine political affiliations that US counties has had over the years. This will be used in conjunction with the 1st dataset to assess attitudes across different counties and see if they vary in relation with political attitudes. This dataset's [terms of use](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ) classify it as being under public domain or CC0.

4. Additionally, I will use data from the [UN Website](http://data.un.org/Explorer.aspx) to further explore trends in greenhouse gases. I will also be using [EPA's data](https://www.epa.gov/ghgreporting/ghg-reporting-program-data-sets) on fuel emissions by type for a more detailed breakdown. The [Terms of Use](http://data.un.org/Host.aspx?Content=UNdataUse) for the UN Data certify that anyone is free to use, duplicate and redistribute so long as proper attribution is made.

**Ethical considerations:**
Some of the survey data may suffer from selection bias. But we can use the data knowing that this is the case. It is noteworthy to mention that I will be using data gathered from certain individuals (those who voted in elections, those who participated in surveys) to extrapolate generalized information about entire populations. We must accept these results, knowing the aforementioned background. It is possible that these conclusions do not speak for communities as a whole.

## Methodology

I will be using several different kinds of analysis for this exploration.

**Data engineering:** I will aim to combine multiple datasets so that I am able to explore trends, patterns and hypotheses relating to public opinions on climate change. Additionally, I will clean and prepare the data so it is ready for conducting analysis.

**Time series:** I will analyse the existing data for global surface temperature and aim to predict, using a time series analysis, the surface temperature 10 or 20 years in the future. I chose this type of analysis because I want to be able to project what the global surface temperature 10/20 years in the future might look like.

**Hypothesis tests:** I will aim to answer my hypothesis on how political affiliations affect people's beliefs on climate change, using principles of applied statistics. This may involve t-tests or z-tests, in order to tell us with what confidence we can come to certain conclusions about a hypothesis.

**Visualizations:** I will create visualizations that will help people discern various trends and patterns in the data. For instance, if we graph surface temperature against sea level measurements in individual regions over time, we can find the relation between temperatures and sea levels.

 
## Unknowns and Dependencies

There are quite a few things that can potentially limit my ability to complete my project on time. Firstly, a lot of the data is messy and the engineering that I must do to make the data usable is extensive. Datasets from different sources are very hard to link. So, I need to find common columns in order to answer interesting questions. This might be harder than it seems. Also, much of the UN data is inconsistent with respect to granularity, and countries which are included for each feature. This might lead to the need for aggregating it to higher granularities and impact analysis. The UN data also has a different dataset for each feature (NO2, CO2, CH4 emissions) which might make engineering it(joining separate tables, finding common columns) a time-consuming and meticulous process.

# Importing Libraries, Datasets





In [1]:
# Data extraction libraries
import urllib
from urllib import request

# Data manipulation libraries
import pandas as pd
import numpy as np

# NLP libraries
import nltk
from nltk.corpus import stopwords
import string # list of punctuation
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

# Visualization libraries and configurations
import matplotlib.pyplot as plt
% matplotlib inline
from wordcloud import WordCloud

# System libraries
import os
# Making directories for visualizations and data
os.mkdir("Visualizations")
os.mkdir("Data")

In [2]:
def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

In [3]:
url = 'https://www.epa.gov/sites/production/files/2020-11/emissions_by_unit_and_fuel_type_c_d_aa_10_2020.zip'
download_file(url, 'climate_data.zip')

In [4]:
import zipfile
with zipfile.ZipFile('climate_data.zip', 'r') as zip_ref:
    zip_ref.extractall('Data/')

In [5]:
unit_data = pd.read_excel('Data/Emissions by Unit and Fuel Type.xlsx', 'UNIT_DATA', index_col=None)
fuel_data = pd.read_excel('Data/Emissions by Unit and Fuel Type.xlsx', 'FUEL_DATA', index_col=None)

In [6]:
industry_type = pd.read_excel('Data/Emissions by Unit and Fuel Type.xlsx', 'Industry Type', index_col=None)

In [7]:
unit_data.to_csv('Data/unit_data.csv', encoding='utf-8')
unit_type = pd.read_csv('Data/unit_data.csv', header=1)
unit_type.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,0,This data was reported to EPA by facilities as of 09/26/2020,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,1,All emissions data is presented in units of me...,,,,,,,,,,,,,,,,
1,2,"See the ""FAQs about this Data"" for additional ...",,,,,,,,,,,,,,,,
2,3,Facility Id,FRS Id,Facility Name,City,State,Primary NAICS Code,Reporting Year,Industry Type (subparts),Industry Type (sectors),Unit Name,Unit Type,Unit Reporting Method,Unit Maximum Rated Heat Input Capacity (mmBTU/hr),Unit CO2 emissions (non-biogenic),Unit Methane (CH4) emissions,Unit Nitrous Oxide (N2O) emissions,Unit Biogenic CO2 emissions (metric tons)


In [8]:
fuel_data.to_csv('Data/fuel_data.csv', encoding='utf-8')
fuel_type = pd.read_csv('Data/fuel_data.csv', header=1)
fuel_type.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,0,This data was reported to EPA by facilities as of 09/26/2020,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1,All emissions data is presented in units of me...,,,,,,,,,,,,,,,
1,2,"See the ""FAQs about this Data"" for additional ...",,,,,,,,,,,,,,,
2,3,Facility Id,FRS Id,Facility Name,City,State,Primary NAICS Code,Reporting Year,Industry Type (subparts),Industry Type (sectors),Unit Name,General Fuel Type,Specific Fuel Type,Other Fuel Name,Blend Fuel Name,Fuel Methane (CH4) emissions (mt CO2e),Fuel Nitrous Oxide (N2O) emissions (mt CO2e)


## Dataset for survey on climate change

https://climatecommunication.yale.edu/visualizations-data/ycom-us/

In [9]:
#url = 'https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data/download'
url = 'https://us2.mailchimp.com/mctx/clicks?url=https%3A%2F%2Fmcusercontent.com%2F78464048a89f4b58b97123336%2Ffiles%2F08c8624c-7ca5-46dd-be10-d3fb87ba351f%2FYCOM_2020_Data.csv&h=92d4f4c8687dd1f98a8ca6556a84e25da62c9a844167a668d4b046622b23c016&v=1&xid=e1b1332d76&uid=2618106&pool=contact_facing&subject=2020+Downscaling+Data+Downloaders%3A+Subscription+Confirmed'
download_file(url, 'survey_data.csv')

In [10]:
survey_data = pd.read_csv('survey_data.csv', encoding="ISO-8859-1")
survey_data.head(3)

Unnamed: 0,GeoType,GEOID,GeoName,TotalPop,discuss,discussOppose,reducetax,reducetaxOppose,CO2limits,CO2limitsOppose,localofficials,localofficialsOppose,governor,governorOppose,congress,congressOppose,president,presidentOppose,corporations,corporationsOppose,citizens,citizensOppose,regulate,regulateOppose,supportRPS,supportRPSOppose,drilloffshore,drilloffshoreOppose,drillANWR,drillANWROppose,fundrenewables,fundrenewablesOppose,rebates,rebatesOppose,mediaweekly,mediaweeklyOppose,gwvoteimp,gwvoteimpOppose,teachGW,teachGWOppose,priority,priorityOppose,happening,happeningOppose,human,humanOppose,consensus,consensusOppose,worried,worriedOppose,personal,personalOppose,harmUS,harmUSOppose,devharm,devharmOppose,futuregen,futuregenOppose,harmplants,harmplantsOppose,timing,timingOppose,affectweather,affectweatherOppose
0,National,9999,US,249349804,35.44,64.397,67.832,31.194,67.729,31.706,53.699,14.836,52.391,15.356,59.637,18.07,60.207,11.245,69.962,9.828,63.645,10.896,74.545,24.437,64.953,34.012,52.152,47.054,31.626,67.635,85.723,13.531,82.099,17.125,24.907,74.497,56.384,33.203,77.763,22.01,51.604,22.251,71.637,11.665,57.328,32.403,55.253,24.545,63.331,36.663,42.738,47.216,61.195,28.689,65.08,21.882,70.528,18.096,71.234,18.504,56.261,43.739,64.198,5.892
1,State,1,Alabama,3765888,27.973,71.887,62.843,35.206,62.69,36.574,50.262,15.043,48.997,15.608,53.025,21.209,54.827,13.532,64.191,10.574,58.065,11.876,70.433,28.104,58.386,40.98,62.556,35.986,35.529,62.866,82.285,16.686,79.634,19.105,19.152,80.114,48.467,40.401,75.113,24.47,44.841,27.475,63.347,14.144,50.909,37.789,45.095,27.334,55.752,44.084,37.439,49.158,54.608,31.356,58.001,24.78,60.989,22.537,62.641,21.698,50.425,49.575,53.326,7.127
2,State,2,Alaska,552380,39.532,59.61,64.15,34.16,64.217,35.151,49.862,16.004,45.292,18.031,55.131,22.345,55.47,14.728,68.648,9.489,61.751,9.927,71.899,26.583,59.217,39.29,52.474,46.826,31.7,67.161,85.767,13.191,82.04,16.802,26.879,72.302,56.101,32.978,75.946,23.675,46.416,26.556,70.159,13.81,53.769,35.449,54.596,27.633,61.018,38.567,39.119,52.09,59.708,31.641,64.119,24.607,68.425,20.811,71.07,20.788,52.617,47.383,60.916,5.94


In [11]:
url = 'https://us2.mailchimp.com/mctx/clicks?url=https%3A%2F%2Fmcusercontent.com%2F78464048a89f4b58b97123336%2Ffiles%2F93be55e0-f134-40ac-9160-e2fba94fc5fc%2FYCOM_2020_Metadata.csv&h=361a72545b27dc1cc971aec46896d156d5de0f74bfbbd5ac15f0ab4acfdd7041&v=1&xid=5e381a15de&uid=2618106&pool=contact_facing&subject=2020+Downscaling+Data+Downloaders%3A+Subscription+Confirmed'
download_file(url, 'survey_metadata.csv')

survey_metadata = pd.read_csv('survey_metadata.csv', encoding="ISO-8859-1")
survey_metadata.head(3)

Unnamed: 0,YCOM.VARIABLE.NAME,VARIABLE.DESCRIPTION
0,GeoType,Geographic level
1,GEOID,Geographic abbreviation
2,GeoName,Geographic name
