# Combining the Datasources
---
**Purpose**: Combining all the important sentences from the various news sources we had into one dataframe of important sentences. Then I will make dataframes related to different important categories (such as: lives lost, humans affected, economic affects).

In [5]:
# Libraries 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

%matplotlib inline

### Read in the csv files
Read in the csv files of the important sentences. Notice the additional argument called `parse_dates` set equal to the column "date" which has the publication date of the article in datetime format. 

In [6]:
# Read in data from reliefweb AFTER 2006
# Notice: after the csv name there is an additional argument 
    
reliefweb_df= pd.read_csv("../data/reliefweb_clean.csv",
                            parse_dates = ["date"])

tribune_df = pd.read_csv("../data/tribune_data_clean.csv",
            parse_dates = ["date"])

times_df= pd.read_csv("../data/payal_clean.csv",
            parse_dates = ["date"])

In [7]:
# list of all the dataframes which will be combined together
dataframes = [tribune_df, times_df, reliefweb_df]

In [8]:
# combined dataframe
combined_df = pd.concat(dataframes, ignore_index=True)

In [9]:
combined_df.head()

Unnamed: 0,date,source,text
0,2019-10-21,https://www.tribuneindia.com/news/punjab/flood...,flood-hit farmers to get 9k-quintal wheat seed...
1,2019-10-16,https://www.tribuneindia.com/news/punjab/debt-...,debt relief likely for flood-hit farmers. ruch...
2,2019-10-14,https://www.tribuneindia.com/news/punjab/no-wh...,"no wheat seed disbursal, farmers livid. aparna..."
3,2019-10-14,https://www.tribuneindia.com/news/punjab/farme...,farmers in flood-affected areas to get free wh...
4,2019-10-11,https://www.tribuneindia.com/news/punjab/busin...,business sinks in mandis of flood-hit lohian. ...


In [10]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1472 entries, 0 to 1471
Data columns (total 3 columns):
date      1472 non-null datetime64[ns]
source    1472 non-null object
text      1472 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 34.6+ KB


## Find duplicated articles
Since the same article could be published from a different website or the same article could be published on different days with a different url. I wanted to make sure that we did not have duplicated information going forward. 

In [11]:
# Find duplicated articles based on duplicated text 
dups = combined_df["text"]
combined_df[dups.isin(dups[dups.duplicated()])].sort_values("date")

Unnamed: 0,date,source,text
1313,2012-07-19,https://reliefweb.int/node/513138,southwest monsoon-2012: daily flood situation ...
1310,2012-07-19,https://reliefweb.int/node/512394,southwest monsoon-2012: daily flood situation ...
1090,2015-03-03,https://www.tribuneindia.com/news/community/wh...,wheat crop in moga flattened. kulwinder sandhu...
965,2015-03-03,https://www.tribuneindia.com/news/community/wh...,wheat crop in moga flattened. kulwinder sandhu...
963,2015-04-15,https://www.tribuneindia.com/news/community/ju...,june 30 deadline to clean drains in areas pron...
...,...,...,...
20,2019-10-21,https://www.tribuneindia.com/news/punjab/flood...,flood-hit farmers to get 9k-quintal wheat seed...
322,2019-10-27,https://www.tribuneindia.com/news/punjab/dengu...,"dengue scare in ludhiana, 10-year-old girl die..."
302,2019-10-27,https://www.tribuneindia.com/news/punjab/dengu...,"dengue scare in ludhiana, 10-year-old girl die..."
301,2019-10-29,https://www.tribuneindia.com/news/punjab/hc-ra...,hc raps punjab agri dept officials. saurabh ma...


In [12]:
# Keeping the first appearance of a duplicated text
combined_df = combined_df.drop_duplicates(subset = "text", keep = "first")
combined_df = combined_df.reset_index()

In [13]:
# check 
dups = combined_df["text"]
combined_df[dups.isin(dups[dups.duplicated()])].sort_values("date")

Unnamed: 0,index,date,source,text


A little over 200 articles were repeated and they were removed from the dataframe. 

In [14]:
# check info
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1225 entries, 0 to 1224
Data columns (total 4 columns):
index     1225 non-null int64
date      1225 non-null datetime64[ns]
source    1225 non-null object
text      1225 non-null object
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 38.4+ KB


In [15]:
# save to csv file 
# create csv file from the new dataframe
combined_df.to_csv("../data/combined.csv", index = False)

**Summary**: In this notebook, the cleaned data from each of the news source were combined together into one dataframe and then saved as a single csv. Next this csv file will be used to extract meaningful information from it to make a database about historic floods in Punjab, India.  