In [1]:
import pandas as pd
import sys
sys.path.append('../src')
from scrapy import Scrapy
from convert_json_csv import ConvertJSONToCsv

In [None]:
app_ids = ['com.combanketh.mobilebanking', 'com.boa.boaMobileBanking', 'com.dashen.dashensuperapp']

**Collecting App Review Data**

In this step, we initialize the data conversion and scraping classes, then iterate through the list of app IDs to collect review data for each banking app. For every app ID in `app_ids`, we use the `Scrapy` class to fetch and save the review data as a JSON file in the `../data/` directory. This process ensures that raw review data for each app is stored locally for further processing and analysis.

In [6]:

csv = ConvertJSONToCsv()
scrapy = Scrapy()
for app_id in app_ids:
    data = scrapy.save_data_json('../data/',f'data-{app_id}.json', app_id)
   


Data already imported
Data already imported


**Saving the Data to a CSV File**

After collecting and processing the app review data, we save the cleaned DataFrame to a CSV file for further analysis or sharing. This ensures the data is stored in a structured, accessible format.

- Each app's reviews are saved as a separate CSV file named after the bank.
- The CSV files are stored in the `../data/` directory.
- This step is essential for downstream tasks such as data exploration, visualization, or machine learning.

In [7]:
import json
for id in app_ids:
    with open(f'../data/data-{id}.json') as f:
        data = json.load(f)
        print(f'{id} total reviews {len(data)}')
        csv.to_csv(data, f'{id.split('.')[1]}.csv')

com.combanketh.mobilebanking total reviews 7500
com.boa.boaMobileBanking total reviews 1044
com.dashen.dashensuperapp total reviews 449


**Preprocessing and Cleaning App Review Data**

In this step, we preprocess the raw review data for each banking app to ensure consistency and quality for further analysis:

- **Import Required Modules:** We use `pandas` for data manipulation and import custom preprocessing functions `clean_data` and `drop_column`.
- **Iterate Through Each App:** For every app in `app_ids`, we:
    - Load the corresponding CSV file into a DataFrame.
    - Print basic information about the data for inspection.
    - Clean the data using the `clean_data` function.
    - Add `bank` and `source` columns to label the data.
    - Remove unnecessary columns (`userImage`, `userName`, `thumbsUpCount`, `reviewId`) if they exist.
    - Save the cleaned DataFrame back to CSV for downstream tasks.

This ensures that all review datasets are standardized, labeled, and free from irrelevant columns, making them ready for sentiment analysis and further exploration.

In [10]:
import pandas as pd
from preprocess import clean_data, drop_column
filtered_data = None
for id in app_ids:
    bank = id.split('.')[1]
    data = pd.read_csv(f'../data/{bank}.csv')
    print(id.split('.')[1])
    print('='*30)
    print(data.info())
    cleaned_data = clean_data(data)
    cleaned_data['bank'] = bank
    cleaned_data['source'] = 'play-store'
    if {'userImage', 'userName', 'thumbsUpCount', 'reviewId'}.issubset(cleaned_data.columns):
        drop_column(cleaned_data,['userImage', 'userName', 'thumbsUpCount', 'reviewId'])
    print(data.info())
    cleaned_data.to_csv(f'../data/{bank}.csv')



combanketh
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  7500 non-null   int64 
 1   content     7493 non-null   object
 2   score       7500 non-null   int64 
 3   at          7500 non-null   object
 4   bank        7500 non-null   object
 5   source      7500 non-null   object
dtypes: int64(2), object(4)
memory usage: 351.7+ KB
None
columns above the threshold
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Unnamed: 0  7500 non-null   int64         
 1   content     7493 non-null   object        
 2   score       7500 non-null   int64         
 3   at          7500 non-null   datetime64[ns]
 4   bank        7500 non-null   object        
 5   source      7500 non-null   obj

**Sentiment Analysis on App Reviews**

This step performs sentiment analysis on the cleaned app review data:

- **Import the Sentiment Analysis Function:** Use `sentiment_analysis` to classify review sentiments.
- **Load Cleaned Data:** For each app, load its cleaned CSV file into a DataFrame.
- **Handle Missing Content:** Remove rows where the review content is missing to avoid errors during analysis.
- **Apply Sentiment Analysis:** If the DataFrame does not already contain a `sentiment` column, apply the sentiment analysis function to each review and add the results as a new column.
- **Preview Results:** Display the first few rows of the DataFrame to verify the sentiment analysis output.

In [4]:
from sentiment import sentiment_analysis

for id in app_ids:
    bank = id.split('.')[1]
    data = pd.read_csv(f'../data/{bank}.csv')
    # Drop rows where 'content' is NaN to avoid errors in sentiment_analysis
    data = data.dropna(subset=['content'])
    if not {'sentiment'}.issubset(data.columns):
        data['sentiment'] = data['content'].apply(sentiment_analysis)
    print(data.head())


   Unnamed: 0.1  Unnamed: 0  \
0             0           0   
1             1           1   
2             2           2   
3             3           3   
4             4           4   

                                             content  score          at  \
0                                               good      5  2025-06-04   
1  it was good app but it have some issues like i...      2  2025-06-04   
2                                              dedeb      5  2025-06-04   
3                                               good      5  2025-06-04   
4                                               Good      5  2025-06-04   

         bank      source  sentiment  
0  combanketh  play-store     0.4404  
1  combanketh  play-store     0.6369  
2  combanketh  play-store     0.0000  
3  combanketh  play-store     0.4404  
4  combanketh  play-store     0.4404  
   Unnamed: 0.1  Unnamed: 0  \
0             0           0   
1             1           1   
2             2           2   
3   