# Instructions

this will earn level 2 for **construct**

Your goal is to build and prepare two ready to analyze datasets. You will submit a notebook that describes how you built and prepared each dataset.

* Each finished dataset must be produced from two or more tables.

* At least one must come from an sqlite database, either by merging results from multiple queries or multiple tables.

* You should use at least three different merges and one concatenate

Your completed datasets should have:

* column names that are well formatted (only lowercase letters, numbers and _)

* an added column that is derived from one or more other columns (string operation or calculation)

For each dataset, pose one question that could not be answered from the input data files as provided and demonstrate how to answer it with the dataset you built. This could be something that can be answered with using only shape of the merged data, but if you need summarize and visualize level 2 achievements, you should use more statistics and plots.

Your notebooks must be in the top level of the repository, not in a subfolder.
Additional Achievements:

if you already earned prior achievements you can ignore the following

To earn level 2 for **prepare**, one of your analyses must use datasets with missing values and one must be provided as excel files with merged columns (for example from NCES). You may use one dataset with both merged columns and missing data or one of each. You must also use datasets that have column names that need repair.
For the merged column data, either before or after merging, you must additionally:

* create separate separate tables for original and aggregate values (eg percentages or sums that can be recovered from the other columns)

* unstack all levels of the data to create a single level index over the columns* create a database with the tables


![](tweet.png)

# Description of Datasets
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

Some exploratory ideas on this include:

* when tweets are negative, what topics do travelers tend to discuss?
* when tweets are positive, what are travelers happy with?
* outside of sentiment, what other systematic variation exists in the tweets about different airlines?
* what types of tweets on airlines tend to be retweeted?

[Reference](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

## Importing Required Libraries

In [1]:
import sqlite3 
import pandas as pd

## Loading Database

If we need to work with larger datasets, one option is working with larger computers. But another option is loading data from a database instead of loading all data to the memory(RAM). One way of doing the later alternative is using a python library named "sqlite3" which allow us to creat a connection between database and Python.

Database should be downloaded, because when we need to make a connection to the database and have a dynamic interaction with that. 

After downloading database,  there are three steps we need to take before working with data out of database:
1. Making conncetion between jupyter notebook (python environment), and database.
2. Running quiries on the database.
3. Getting the information out of database and work with that in Python.

**Making the conncetion with the "database" database**

In [2]:
con = sqlite3.connect('data/tweeter/database.sqlite')

### One Method for loading data is using a cursor

**Creating Cursor** (only for demonstration)

In [3]:
cur = con.cursor()

In [4]:
cur.execute("SELECT * FROM Tweets")

<sqlite3.Cursor at 0x7f3d6b3bc8f0>

We can make a list of our columns using description method (its first component):

In [5]:
colnames = []
for desc in cur.description:
    colnames.append(desc[0])

In [6]:
colnames

['tweet_id',
 'airline_sentiment',
 'airline_sentiment_confidence',
 'negativereason',
 'negativereason_confidence',
 'airline',
 'airline_sentiment_gold',
 'name',
 'negativereason_gold',
 'retweet_count',
 'text',
 'tweet_coord',
 'tweet_created',
 'tweet_location',
 'user_timezone']

**fetching all the rows of query result**

In [7]:
#It returns all the rows as a list of tuples.
data = cur.fetchall()

In [8]:
#showing the first two rows
data[:2]

[(567588278875213824,
  'neutral',
  1,
  '',
  '',
  'Delta',
  '',
  'JetBlueNews',
  '',
  0,
  "@JetBlue's new CEO seeks the right balance to please passengers and Wall ... - Greenfield Daily Reporter http://t.co/LM3opxkxch",
  '',
  '2015-02-16 23:36:05 -0800',
  'USA',
  'Sydney'),
 (567590027375702016,
  'negative',
  1,
  "Can't Tell",
  0.6503,
  'Delta',
  '',
  'nesi_1992',
  '',
  0,
  '@JetBlue is REALLY getting on my nerves !! 😡😡 #nothappy',
  '',
  '2015-02-16 23:43:02 -0800',
  'undecided',
  'Pacific Time (US & Canada)')]

**Making a dataframe using the rows and columns we made in previous steps**

In [9]:
pd.DataFrame(data =data, columns = colnames)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,567588278875213824,neutral,1.0000,,,Delta,,JetBlueNews,,0,@JetBlue's new CEO seeks the right balance to ...,,2015-02-16 23:36:05 -0800,USA,Sydney
1,567590027375702016,negative,1.0000,Can't Tell,0.6503,Delta,,nesi_1992,,0,@JetBlue is REALLY getting on my nerves !! 😡😡 ...,,2015-02-16 23:43:02 -0800,undecided,Pacific Time (US & Canada)
2,567591480085463040,negative,1.0000,Late Flight,0.346,United,,CPoutloud,,0,@united yes. We waited in line for almost an h...,,2015-02-16 23:48:48 -0800,"Washington, DC",
3,567592368451248130,negative,1.0000,Late Flight,1,United,,brenduch,,0,@united the we got into the gate at IAH on tim...,,2015-02-16 23:52:20 -0800,,Buenos Aires
4,567594449874587648,negative,1.0000,Customer Service Issue,0.3451,Southwest,,VahidESQ,,0,@SouthwestAir its cool that my bags take a bit...,,2015-02-17 00:00:36 -0800,"Los Angeles, CA",Pacific Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14480,570309308937842688,neutral,0.6869,,,Delta,,Oneladyyouadore,,0,@JetBlue I hope so because I fly very often an...,,2015-02-24 11:48:29 -0800,Georgia,Quito
14481,570309340952993796,neutral,1.0000,,,US Airways,,DebbiMcGinnis,,0,@USAirways is a DM possible if you aren't foll...,,2015-02-24 11:48:37 -0800,Missourah,Hawaii
14482,570309345281486848,positive,0.6469,,,Delta,,jaxbra,,0,@JetBlue Yesterday on my way from EWR to FLL j...,,2015-02-24 11:48:38 -0800,"east brunswick, nj",Atlantic Time (Canada)
14483,570310144459972608,negative,1.0000,Customer Service Issue,1,US Airways,,GAKotsch,,0,@USAirways and when will one of these agents b...,,2015-02-24 11:51:48 -0800,,Atlantic Time (Canada)


If we normally open a onnection, we have to close it at the end of operations

In [10]:
con.close()

If we open the connection using "with", it will be closed automatically at the end of loop.

In [11]:
with sqlite3.connect('data/tweeter/database.sqlite') as con:
    cur = con.cursor()
    cur.execute('SELECT * FROM Tweets')
    row = cur.fetchone()


### Second method for Accessing data stored in SQLite, is using Python and Pandas .
we can also combine use of sqlite3 (to access databases) with Pandas' functions to pull data from databases to work with.

If from the airline column we only want data from Delta airline, we can do that using a condition (where clause) in our query.

In [12]:
df = pd.read_sql_query('SELECT * FROM Tweets WHERE airline = "Delta"',con)

In [13]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,567588278875213824,neutral,1.0,,,Delta,,JetBlueNews,,0,@JetBlue's new CEO seeks the right balance to ...,,2015-02-16 23:36:05 -0800,USA,Sydney
1,567590027375702016,negative,1.0,Can't Tell,0.6503,Delta,,nesi_1992,,0,@JetBlue is REALLY getting on my nerves !! 😡😡 ...,,2015-02-16 23:43:02 -0800,undecided,Pacific Time (US & Canada)
2,567667301067915264,neutral,1.0,,,Delta,,BritishAirNews,,0,"@JetBlue CEO weighs profits, flyers - @Chronic...",,2015-02-17 04:50:05 -0800,UK,Sydney
3,567671602280923136,positive,1.0,,,Delta,,twinkletaters,,0,@JetBlue Thanks! Her flight leaves at 2 but sh...,"[32.07301184, -81.09362691]",2015-02-17 05:07:11 -0800,"Savannah, GA",
4,567680108002291712,positive,0.6645,,0.0,Delta,,TravellerLukose,,0,@JetBlue No worries. Delay was minor and dealt...,,2015-02-17 05:40:59 -0800,,


We can see unique airlines in our datadrame as below

In [14]:
pd.unique(df['airline'])

array(['Delta'], dtype=object)

 ## **Hint:** At least one must come from an sqlite database, either by merging results from multiple queries or multiple tables.

Here we can see sentiment confidence number for tweets, which ahve been filtered by the airline company and onlt positive tweets. 

We can make this filter by Selecting **"airline_sentiment_confidence " airline = "Delta" AND airline_sentiment = "positive"** in the query . 

In [15]:
df_sentiment_delta_positive= pd.read_sql_query('SELECT tweet_id, airline_sentiment_confidence FROM Tweets WHERE airline = "Delta" AND airline_sentiment = "positive"' ,con)

In [16]:
df_sentiment_delta_positive.head()

Unnamed: 0,tweet_id,airline_sentiment_confidence
0,567671602280923136,1.0
1,567680108002291712,0.6645
2,567724178317402112,0.6429
3,567727329783582720,1.0
4,567727526777073664,1.0


In [17]:
df_sentiment_delta_positive.shape

(544, 2)

selecting names when airline is Delta and tweet location is USA

In [18]:
df_name_delta_USA = pd.read_sql_query('SELECT tweet_id, name FROM Tweets WHERE airline = "Delta" AND tweet_location = "USA"',con)

In [19]:
df_name_delta_USA.head()

Unnamed: 0,tweet_id,name
0,567588278875213824,JetBlueNews
1,567716378681933825,JetBlueNews
2,567724178317402112,JetBlueNews
3,567746576626356225,JetBlueNews
4,567754367860633600,JetBlueNews


In [20]:
df_name_delta_USA.shape

(71, 2)

selecting text of tweets when airline is either Delta or US Airways

In [21]:
df_text_delta_USAirways = pd.read_sql_query('SELECT tweet_id, text FROM Tweets WHERE airline = "Delta" OR airline = "US Airways"',con)

In [22]:
df_text_delta_USAirways.head()

Unnamed: 0,tweet_id,text
0,567588278875213824,@JetBlue's new CEO seeks the right balance to ...
1,567590027375702016,@JetBlue is REALLY getting on my nerves !! 😡😡 ...
2,567643252753694721,@USAirways how's us 1797 looking today?
3,567667301067915264,"@JetBlue CEO weighs profits, flyers - @Chronic..."
4,567670985403285504,@USAirways @AmericanAir How r u supposed to ch...


### Merging queries

**inner merge**

In [23]:
df_merged_delta_inner = pd.merge(df_sentiment_delta_positive, df_name_delta_USA, on = 'tweet_id', how = 'inner')

In [24]:
df_merged_delta_inner.shape

(6, 3)

In [25]:
df_merged_delta_inner.head()

Unnamed: 0,tweet_id,airline_sentiment_confidence,name
0,567724178317402112,0.6429,JetBlueNews
1,567814769077436416,0.6591,JetBlueNews
2,568101669809946624,0.673,JetBlueNews
3,568335451741749248,0.6465,JetBlueNews
4,568924327275401218,0.6544,JetBlueNews


**Left merge**

In [26]:
df_merged_delta_left = pd.merge(df_sentiment_delta_positive, df_name_delta_USA, on = 'tweet_id', how = 'left')

In [27]:
df_merged_delta_left.shape

(544, 3)

In [28]:
df_merged_delta_left.head()

Unnamed: 0,tweet_id,airline_sentiment_confidence,name
0,567671602280923136,1.0,
1,567680108002291712,0.6645,
2,567724178317402112,0.6429,JetBlueNews
3,567727329783582720,1.0,
4,567727526777073664,1.0,


**right merge**

In [29]:
df_merged_delta_right = pd.merge(df_sentiment_delta_positive, df_name_delta_USA, on = 'tweet_id', how = 'right')

In [30]:
df_merged_delta_right.shape

(71, 3)

In [31]:
df_merged_delta_right.head()

Unnamed: 0,tweet_id,airline_sentiment_confidence,name
0,567588278875213824,,JetBlueNews
1,567716378681933825,,JetBlueNews
2,567724178317402112,0.6429,JetBlueNews
3,567746576626356225,,JetBlueNews
4,567754367860633600,,JetBlueNews


**outer merge**

In [32]:
df_merged_delta_outer = pd.merge(df_sentiment_delta_positive, df_name_delta_USA, on = 'tweet_id', how = 'outer')

In [33]:
df_merged_delta_outer.shape

(609, 3)

In [34]:
df_merged_delta_outer.head()

Unnamed: 0,tweet_id,airline_sentiment_confidence,name
0,567671602280923136,1.0,
1,567680108002291712,0.6645,
2,567724178317402112,0.6429,JetBlueNews
3,567727329783582720,1.0,
4,567727526777073664,1.0,


**Limiting loaded data**

We can make a limit for the number of rows we are getting by using "LIMIT" keyword in the query.

In [35]:
pd.read_sql_query("SELECT * FROM Tweets LIMIT 10",con)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,567588278875213824,neutral,1,,,Delta,,JetBlueNews,,0,@JetBlue's new CEO seeks the right balance to ...,,2015-02-16 23:36:05 -0800,USA,Sydney
1,567590027375702016,negative,1,Can't Tell,0.6503,Delta,,nesi_1992,,0,@JetBlue is REALLY getting on my nerves !! 😡😡 ...,,2015-02-16 23:43:02 -0800,undecided,Pacific Time (US & Canada)
2,567591480085463040,negative,1,Late Flight,0.346,United,,CPoutloud,,0,@united yes. We waited in line for almost an h...,,2015-02-16 23:48:48 -0800,"Washington, DC",
3,567592368451248130,negative,1,Late Flight,1.0,United,,brenduch,,0,@united the we got into the gate at IAH on tim...,,2015-02-16 23:52:20 -0800,,Buenos Aires
4,567594449874587648,negative,1,Customer Service Issue,0.3451,Southwest,,VahidESQ,,0,@SouthwestAir its cool that my bags take a bit...,,2015-02-17 00:00:36 -0800,"Los Angeles, CA",Pacific Time (US & Canada)
5,567594579310825473,negative,1,Bad Flight,0.6707,United,,brenduch,,0,@united and don't hope for me having a nicer f...,,2015-02-17 00:01:07 -0800,,Buenos Aires
6,567595670463205376,negative,1,Late Flight,1.0,United,,CRomerDome,,0,@united I like delays less than you because I'...,,2015-02-17 00:05:27 -0800,"Portland, OR",Pacific Time (US & Canada)
7,567614049425555457,negative,1,Customer Service Issue,0.3545,United,,JustOGG,,0,"@united, link to current status of flights/air...",,2015-02-17 01:18:29 -0800,Tweets = My Opinion,Eastern Time (US & Canada)
8,567617081336950784,negative,1,Customer Service Issue,1.0,Southwest,,mrshossruns,,0,@SouthwestAir you guys there? Are we on hour 2...,,2015-02-17 01:30:32 -0800,,Eastern Time (US & Canada)
9,567617486703853568,negative,1,Customer Service Issue,0.6797,United,,feliciastoler,,0,@united I tried 2 DM it would not go thru... n...,"[0.0, 0.0]",2015-02-17 01:32:09 -0800,New Jersey,Central Time (US & Canada)


## Cleaning dataframes

we can drop rows with rows  with NaN values. 

In [36]:
df_merged_delta_right.dropna(how='any',inplace=True)

In [37]:
df_merged_delta_right.shape

(6, 3)

And assign tweet_id as the index for the dataframe

In [38]:
df_merged_delta_right.set_index('tweet_id', drop = True)

Unnamed: 0_level_0,airline_sentiment_confidence,name
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
567724178317402112,0.6429,JetBlueNews
567814769077436416,0.6591,JetBlueNews
568101669809946624,0.673,JetBlueNews
568335451741749248,0.6465,JetBlueNews
568924327275401218,0.6544,JetBlueNews
569899437662863360,1.0,Balhajji


**We can see the average airline_sentiment_confidence for each name**

In [39]:
df_merged_delta_right.groupby('name')['airline_sentiment_confidence'].mean().reset_index()

Unnamed: 0,name,airline_sentiment_confidence
0,Balhajji,1.0
1,JetBlueNews,0.65518


**Also see how many names we have in each group**

In [40]:
df_merged_delta_right.groupby('name')['airline_sentiment_confidence'].count().reset_index()

Unnamed: 0,name,airline_sentiment_confidence
0,Balhajji,1
1,JetBlueNews,5
