## Hands-On Data Preprocessing in Python
Learn how to effectively prepare data for successful data analytics
    
## Data Fusion and Data Integration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Data Fusion versus Data Integration
**Data integration** is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure. Data integration allows different data types (such as data sets, documents and tables) to be merged by users, organizations and applications, for use as personal or business processes and/or functions.
<br></br>
**Data fusion** is the process of getting data from multiple sources in order to build more sophisticated models and understand more about a project. It often means getting combined data on a single subject and combining it for central analysis.
<br></br>

<center><img src="https://drive.google.com/uc?id=1DDSaTJ0sGTDwAqBYEzOXR4d6ITZTXvyV" width="500"/></center>


**Data integration example**: Imagine that a company would like to analyze its effectiveness in how it advertises. The company needs to come up with two columns of data – the total sales per customer and the total amount of advertisement expenditure per customer. As the sales department and marketing department keep and manage their databases, each department will be tasked with creating a list of customers with the relevant information. Once they've done that, they need to connect the data of each customer from the two sources. This connection can be made by relying on the existence of real customers, so no assumptions need to be made. No changes need to be made to connect this data.
<br></br>
**Data fusion example**: Imagine a technology-empowered farmer who would like to see the influence of irrigation (water dispersion) on yield. The farmer has data regarding both the amount of water its revolving water stations have dispensed and the amount of harvest from each point in the farm. Each stationary water station has a sensor and calculates and records the amount of water that is dispensed. Also, each time the blade in the combine harvester moves, the machine calculates and record the amount of harvest and the location.
<br></br>
In this example, there is no clear connection between the sources of data. In the previous example, the clear connection was the definition of data objects - customers. So we need to make assumptions and change the data so that a connection is possible.

## Frequent Challenges of Data Fusion and Integration
**Challenge 1 –  entity identification**

The challenge is that the data objects in all the data sources are the same real-world entities with the same definitions of data objects, but they are not easy to connect due to the unique identifiers in the data sources.

For instance, in the data integration example section, the sales department and the marketing department did not use a central customer unique identifier for all their customers. Due to this lack of data management, when they want to integrate the data, they will have to figure out which customer is which in the data sources.
<br></br>

**Challenge 2 –  unwise data collection**

This data integration challenge happens, as its name suggests, due to unwise data collection. For instance, instead of using a centralized database, the data of different data objects is stored in multiple files.
<br></br>

**Challenge 3 –  index mismatched formatting**
<img src="https://drive.google.com/uc?id=1LvBE81XfzA9BUqJFFOAepnwPfQM9XELM" width="700"/>

**Challenge 4 –  aggregation mismatch**

This challenge occurs when integrating data sources by adding attributes. When integrating time series data sources whose time intervals are not identical, this challenge arises.

<img src="https://drive.google.com/uc?id=1-XCAVYTohQaw9YbBG1hrrqxj1eL2OjEy" width="500"/>
<br></br>

**Challenge 5 –  duplicate data objects**

This challenge occurs when we're integrating data sources by <font color='blue'>adding data objects</font>. When the sources contain data objects that are also in the other sources, when the data sources are integrated, there will be duplicates of the same data objects in the integrated dataset.

For example, imagine a hospital that provides different kinds of healthcare services. For a project, we need to gather the socioeconomic data of all of the patients in the hospital. The imaginary hospital does not have a centralized database, so all of the departments are tasked with returning a dataset containing all the patients they have provided services for. After integrating all of the datasets from different departments, you should expect that there are multiple rows for the patients that had to receive care from different departments in the hospital.
<br></br>

**Challenge 6 – data redundancy**

Unlike the previous challenge, this challenge may be faced when you're integrating data sources by <font color='blue'>adding attributes</font>. After data integration, some of the attributes may be redundant. This redundancy could be shallow as there are two attributes with different titles but the same data. Or, it could be deeper. In deeper data redundancy cases, the redundant attribute does not have the same title, nor is its data the same as one of the other attributes, but the values of the redundant attribute can be derived from the other attributes.


### Example 1 (Challenges 3 & 4)
---
In this example, we have two sources of data. The first was retrieved from the local electricity provider that holds the electricity consumption (**Electricity Data 2016_2017.csv**), while the other was retrieved from the local weather station and includes temperature data (**Temperature 2016.csv**). We want to see if we can come up with a visualization that can answer if and how the amount of electricity consumption is affected by the weather.

In [None]:
electric_df = pd.read_csv('Electricity Data 2016_2017.csv')
temp_df = pd.read_csv('Temperature 2016.csv')

In [None]:
electric_df

In [None]:
temp_df

- The data object definition of electric_df is the electric consumption in 15 minutes, but the data object definition of temp_df is the temperature every 1 hour. This shows that we have to face the aggregation mismatch challenge of data integration (**Challenge 4**).
- temp_df only contains the data for 2016, while electric_df contains the data for 2016 and some parts of 2017.
- Neither temp_df nor electric_df has indexes that can be used to connect the data objects across the two DataFrames. This shows that we will also have to face the challenge of index mismatched formatting (**Challenge 3**).
<br></br>

**1.1 Remove the 2017 data objects from electric_df**

In [None]:
BM = electric_df.Date.str.contains('2017')
dropping_index = electric_df[BM].index
electric_df.drop(index = dropping_index,inplace=True)

In [None]:
electric_df

**1.2 Add a new column titled <font color='blue'>Hour</font> to electric_df from the Time attribute**

In [None]:
electric_df['Hour'] = electric_df.Time.apply(lambda v: '{}:00'.format(v.split(':')[0]))
electric_df

**1.3 Create a new data structure whose definition of the data object is hourly electricity consumption**

The following code uses the <font color='blue'>.groupby()</font> function to create *integrate_sr*. The Pandas *integrate_sr* series is a stopgap data structure that will be used for integration in the later steps.

> One good question to ask here is this, why are we using the <font color='blue'>.sum()</font> aggregate function instead of .<font color='blue'>mean()</font>? The reason is the nature of the data. The electricity consumption of an hour is the summation of the electricity consumption of its half-hour pieces.

In [None]:
integrate_sr = electric_df.groupby(['Date','Hour']).Consumption.sum()
integrate_sr

**1.4 Add the <font color='blue'>Date</font> and <font color='blue'>Hour</font> columns to temp_df from Timestamp**

In [None]:
temp_df

In [None]:
def unpackTimestamp(r):
  ts = r.Timestamp
  date,time = ts.split('T')
  hour = time.split(':')[0]
  year,month,day = date.split('-')

  r['Hour'] = '{}:00'.format(int(hour))
  r['Date'] = '{}/{}/{}'.format(int(month),int(day),year)
  return(r)

In [None]:
temp_df = temp_df.apply(unpackTimestamp,axis=1)

In [None]:
temp_df = temp_df.set_index(['Date','Hour']).drop(columns=['Timestamp'])

In [None]:
temp_df

**1.5 Ready to use <font color='blue'>.join()</font> to integrate the two sources**

In [None]:
integrate_df = temp_df.join(integrate_sr)

In [None]:
integrate_df

**1.6 Reset the index of integrate_df**

We no longer need the index for integration purposes, nor do we need those values for visualization purposes.

In [None]:
integrate_df.reset_index(inplace=True)

In [None]:
integrate_df

**1.7 Create a line plot of the whole year's electricity consumption**

- The code created the <font color='blue'>days</font> list, which contains all the unique dates from integrate_df. By and large, the preceding code is a loop through the days list, and for each unique day, the line plot of electricity consumption is drawn and added to the days before and after. The color of each day's line plot is determined by that day's temperature average, that is, <font color='blue'>temp.mean()</font>.

- The colors in the visualization are created based on the **RGB** color codes. RGB stands for Red, Green, and Blue. All colors can be created by using a combination of these three colors. You can specify the amount of each color you'd like and Matplotlib will produce that color for you. These colors can take values from 0 to 1 for Matplotlib.

- A <font color='blue'>Boolean Mask (BM)</font> and <font color='blue'>plt.xticks()</font> are used to include the 28th of each month on the x axis so that we don't have a cluttered x axis.

In [None]:
days = integrate_df.Date.unique()

max_temp, min_temp = integrate_df.temp.max(), integrate_df.temp.min()
green = 0.1

plt.figure(figsize=(20,5))

for d in days:
  BM = integrate_df.Date == d
  wdf = integrate_df[BM]

  average_temp = wdf.temp.mean()
  red = (average_temp - min_temp)/ (max_temp - min_temp)
  blue = 1-red
  clr = [red,green,blue]
  plt.plot(wdf.index,wdf.Consumption,c = clr)

BM = (integrate_df.Hour =='0:00') & (integrate_df.Date.str.contains('/28/'))
plt.xticks(integrate_df[BM].index,integrate_df[BM].Date,rotation=90)
plt.grid()
plt.margins(y=0,x=0)
plt.show()

### Example 2 (Challenge 2 & 3)
---
The data includes the sensor performance readings of six taekwondo athletes, who have varying levels of experience and expertise. We would like to see if the athlete's- gender, age, weight, and experience influence the level of impact they can create when they perform the following techniques:
- Roundhouse/Round Kick (R)
- Back Kick (B)
- Cut Kick (C)
- Punch (P)

In [None]:
athlete_df = pd.read_csv('Table1.csv')
athlete_df

In [None]:
unknown_df = pd.read_csv('Taekwondo.csv')
unknown_df

**2.1 Create an empty pandas DataFrame called
<font color='blue'>performance_df**</font>

This dataset has been designed so that both *athlete_df* and
*unknown_df* can be integrated into it.

In [None]:
designed_columns = ['Participant_id','Gender','Age','Weight','Experience','Technique_id','Trial_number','Average_read']
n_rows = len(unknown_df.columns)-1
performance_df = pd.DataFrame(index=range(n_rows),columns=designed_columns)

In [None]:
performance_df

**2.2 Perform some level I data cleaning for <font color='blue'>athlete_df</font>**

In [None]:
athlete_df.set_index('Participant ID',inplace=True)
athlete_df.columns = ['Sex', 'Age', 'Weight', 'Experience', 'Belt']
athlete_df

**2.3 Create and run the loop that will fill up
<font color='blue'>performance_df</font>**

Because the dataset has been collected unwisely, we cannot use simple functions such as <font color='blue'>.join()</font> for data integration here. Instead, we need to use a loop to go through the many records of unknown_df and athlete_df and fill out performance_df row by row and, at times, cell by cell.

In [None]:
techniques = ['R','B','C','P']
index = 0
for col in unknown_df.columns:
    if(col[0] in techniques):
        performance_df.loc[index,'Technique_id'] = col[0]
        performance_df.loc[index,'Trial_number'] = unknown_df[col][1]

        P_id = unknown_df[col][0]
        performance_df.loc[index,'Participant_id'] = P_id
        performance_df.loc[index,'Gender'] = athlete_df.loc[P_id].Sex
        performance_df.loc[index,'Age'] = athlete_df.loc[P_id].Age
        performance_df.loc[index,'Weight'] = athlete_df.loc[P_id].Weight
        performance_df.loc[index,'Experience'] = athlete_df.loc[P_id].Experience

        BM = unknown_df[col][2:].isna()
        performance_df.loc[index,'Average_read'] = unknown_df[col][2:][~BM].astype(int).mean()
        index +=1

In [None]:
performance_df

**2.4 Bring our attention to the data analytic goals**

In [None]:
select_attributes = ['Gender','Age','Experience','Weight']

for i,att in enumerate(select_attributes):
  plt.subplot(2,2,i+1)
  sns.boxplot(data=performance_df, y='Average_read', x=att)

plt.tight_layout()
plt.show()

In the preceding diagram, we can see meaningful relationships between Average_read and Gender, Age, Experience, and Weight. In a nutshell, these attributes can change the impact of the techniques that are performed by the athletes. For example, we can see that as the **experience** of an athlete increases, the impact of the techniques that are performed by the athlete increases.
<br></br>
We can also see a surprising trend: the impact of the techniques that are performed by **female** athletes is significantly higher than the impact of male athletes. After seeing this surprising trend, let's look back at athlete_df. We will realize that there is <font color='red'>only one female</font> athlete in the data, so we cannot count on this visualized trend.