# Data Preparation

### In this notebook we will prepare our data for our search function to use. <br>Currently we have data stored in four different ```csv``` files.<br>
* apps.csv
* user_reviews.csv
<br>
It can be computationally expensive to produce ```analysis results``` from multiple data-sources for incomming stream of requests.<br> So we will prepare our data and save it in an ```easily searchable``` structure.

In [27]:
# Import the needed modules...
import pandas as pd
from collections import defaultdict
from os import getcwd

## Define Paths to data files.

In [28]:
PATH_APPS   = f"apps.csv"
PATH_USER_REVIEWS  = f"user_reviews.csv"

# Data Engineering<br>
* ## Get data in dataframes.
* ## Convert data to a single dictionary.

In [29]:
"""

    Read data from apps.csv
"""
df_apps            = pd.read_csv(PATH_APPS)
apps_table_columns = df_apps.columns.tolist()
print(f"COLUMNS : {apps_table_columns}")

COLUMNS : ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [30]:
"""
    Read data from user_reviews.csv
"""
df_user_reviews            = pd.read_csv(PATH_USER_REVIEWS)
user_reviews_table_columns = df_user_reviews.columns.tolist()
print(f"COLUMNS : {user_reviews_table_columns}")

COLUMNS : ['App', 'Translated_Review', 'Sentiment', 'Sentiment_Polarity', 'Sentiment_Subjectivity']


* ```App``` is a common column in all two tables so we will use it as a primary search-key <br>
* A user will always search a ```app``` by its ```name``` so we will create a ```Global secondary index``` to be able to perform search our  datastore.it will obviously take some extra space but almost negligible as compared to the size of the original data.In addition, It will make our searching faster and efficient so it's a good deal.

In [31]:
print(f"It is {df_apps['App'].is_unique}  that the column 'App' has unique values for all entries in movies dataframe.")
print(f"It is {df_user_reviews['App'].is_unique}  that the column 'App' has unique values for all entries in links dataframe.")
# Sort app dataframe on the basis of App as App is unique for all entries...
df_apps_sorted = df_apps.sort_values(by=['App'])

# Sort links dataframe on the basis of App as App is unique for all entries...
df_user_reviews_sorted  = df_user_reviews.sort_values(by=['App'])

It is True  that the column 'App' has unique values for all entries in movies dataframe.
It is False  that the column 'App' has unique values for all entries in links dataframe.


In [32]:
# from apps dataframe...
apps    = df_apps_sorted["App"].tolist()
Categories = [Category.split("|") for Category in df_apps["Category"].tolist()]
Installs = [Installs.split("|") for Installs in df_apps["Installs"].tolist()]
Prices = [Price.split("|") for Price in df_apps["Price"].tolist()]
Genres = [Genres.split("|") for Genres in df_apps["Genres"].tolist()]
Last_updated = [Last_updated.split("|") for Last_updated in df_apps["Last Updated"].tolist()]

# from user_reviews dataframe...
Translated_Review  = df_user_reviews_sorted["Translated_Review"].tolist()
Sentiment  = df_user_reviews_sorted["Sentiment"].tolist()
Sentiment_Polarity  = df_user_reviews_sorted["Sentiment_Polarity"].tolist()
Sentiment_Subjectivity  = df_user_reviews_sorted["Sentiment_Subjectivity"].tolist()

In [35]:
appDict             = {}
global_secondaryIndex = {}
for idx, apps in enumerate(apps):
    appDict[apps] = {
        "Category" : Categories[idx],
        "Installs" : Installs[idx],
        "Prices" : Prices[idx],
        "Genres" : Genres[idx],
        "Last_updated" : Last_updated[idx],
        "links" : {
            "Translated_Review" : Translated_Review[idx], 
            "Sentiment" : Sentiment[idx],
            "Sentiment_Polarity" : Sentiment_Polarity[idx],
            "Sentiment_Subjectivity" : Sentiment_Subjectivity[idx]
        }
    }
    
    global_secondaryIndex[apps[idx]] = apps

In [36]:
import json
print("[INFO] Writing app Data into the disk...")
with open('dataFinal.json', 'w') as fp:
    json.dump(appDict, fp, sort_keys=True, indent=4)
print("[INFO] Writing Global Secondary Index Data into the disk...")
with open('dataFinal_GIS.json', 'w') as fp:
    json.dump(global_secondaryIndex, fp, sort_keys=True, indent=4)

[INFO] Writing app Data into the disk...
[INFO] Writing Global Secondary Index Data into the disk...


#### At this point, our database is ready and it can handel high inflow of requests.