# Obsei Tutorial 03
## This example shows following Obsei workflow
 1. Observe: App Store's app reviews
 2. Pre-process: Clean review text with properly
 3. Analyze: Classify review text within given category list
 4. Inform: Provide all data in Pandas DataFrame
 5. Store: Store data in Google Drive in CSV format

## Install Obsei from latest code, perform these steps -
- Select GPU RunType for faster computation 
- Restart Runtime after installation


In [None]:
!pip install git+https://github.com/lalitpagaria/obsei.git

Collecting git+https://github.com/lalitpagaria/obsei.git
  Cloning https://github.com/lalitpagaria/obsei.git to /tmp/pip-req-build-wl_1hpon
  Running command git clone -q https://github.com/lalitpagaria/obsei.git /tmp/pip-req-build-wl_1hpon
Building wheels for collected packages: obsei
  Building wheel for obsei (setup.py) ... [?25l[?25hdone
  Created wheel for obsei: filename=obsei-0.0.9-cp37-none-any.whl size=65557 sha256=cce33049986ee20144625a85f90699a6ae020c7a8454bb4f156750446385e03b
  Stored in directory: /tmp/pip-ephem-wheel-cache-4be2m6lr/wheels/49/1a/6e/2fd83c9a275b7096fc615a0edef2d55b1fc33c3751ba45c1ad
Successfully built obsei


## Mount your Google Drive to store CSV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Configure following input -
- `name`: Brand name of App
- `category_list`: List of categories to perform review text classification
- `identifier`: Id of the app, it can be found at the end of the url of app in app store
- `country`: Country of reviews
- `lookup_period`: How many old reviews to collect (**Note**: Apple rate limit and provide max 450 reviews only)
- `extra_stop_words`: Extra stop words top clean from review text

In [None]:
name = "zomato"
category_list = ["easy order placement", "realtime order tracking", "easy payment options", "rewards and discounts","user interface", "social media Integration"]
identifier = "434613896"
country = "in"
lookup_period = "365d"
extra_stop_words = ["i", "-", "day", "will", ".", "use", "n", "without", "please", "app", "ha", "ho", "nt", "wa", 
                    "thi", "plz", "pleas", "ff", "ya", "thank", "you", "thanks", "mai"]

## Configure columns of Pandas DataFrame
`included_cols` will only be returned by Pandas Sink and `rename_cols_dict` will rename selected `included_cols` columns to desired one

In [None]:
included_cols = [f"segmented_data_classifier_data_{category}" for category in category_list]
included_cols.append("segmented_data_classifier_data_positive")
included_cols.append("segmented_data_classifier_data_negative")
included_cols.append("processed_text")
included_cols.append("meta_at")
included_cols.append("meta_date")
included_cols.append("meta_published date")
included_cols.append("meta_rating")
# included_cols.append("meta_title")
included_cols.append("meta_publisher_title")

rename_cols_dict = {f"segmented_data_classifier_data_{category}": category for category in category_list}
rename_cols_dict["segmented_data_classifier_data_positive"] = "positive"
rename_cols_dict["segmented_data_classifier_data_negative"] = "negative"
rename_cols_dict["processed_text"] = "text"
rename_cols_dict["meta_at"] = "time"
rename_cols_dict["meta_date"] = "time"
rename_cols_dict["meta_rating"] = "ratings"
rename_cols_dict["meta_published date"] = "time"
# rename_cols_dict["meta_title"] = "title"
rename_cols_dict["meta_publisher_title"] = "news publisher"
rename_cols_dict['Unnamed: 0'] = 'reviews'

## Configure App Store Review Observer

In [None]:
from obsei.source.appstore_scrapper import (
    AppStoreScrapperConfig,
    AppStoreScrapperSource,
)

source_config = AppStoreScrapperConfig(
    countries=[country],
    app_id=identifier,
    lookup_period=lookup_period
)

source = AppStoreScrapperSource()

## Configure TextCleaner as Pre-Processor to clean review text
These cleaning function will run serially

In [None]:
from obsei.preprocessor.text_cleaner import TextCleaner, TextCleanerConfig
from obsei.preprocessor.text_cleaning_function import *

text_cleaner_config = TextCleanerConfig(
    stop_words=extra_stop_words,
    cleaning_functions = [
        ToLowerCase(),
        RemoveWhiteSpaceAndEmptyToken(),
        RemovePunctuation(),
        RemoveSpecialChars(),
        DecodeUnicode(),
        RemoveDateTime(),
        RemoveStopWords(),
        RemoveStopWords(stop_words=extra_stop_words),
        RemoveWhiteSpaceAndEmptyToken(),
   ]
)

text_cleaner = TextCleaner()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Configure Classification Analyzer
**Note**: Select model from https://huggingface.co/models?pipeline_tag=zero-shot-classification, if you want to try different one

In [None]:
from obsei.analyzer.classification_analyzer import ClassificationAnalyzerConfig, ZeroShotClassificationAnalyzer

analyzer_config=ClassificationAnalyzerConfig(
   labels=category_list,
)

text_analyzer = ZeroShotClassificationAnalyzer(
   model_name_or_path="typeform/mobilebert-uncased-mnli",
   device="auto"
)

## Configure Pandas DataFrame Informer

In [None]:
from pandas import DataFrame
from obsei.sink.pandas_sink import PandasSink, PandasSinkConfig

sink_config = PandasSinkConfig(
   dataframe=DataFrame(),
   include_columns_list=included_cols
)
sink = PandasSink()

## Fetch app reviews

In [None]:
source_response_list = source.lookup(source_config)

## PreProcess review text to clean it

In [None]:
cleaner_response_list = text_cleaner.preprocess_input(
    input_list=source_response_list,
    config=text_cleaner_config
)

## Analyze reviews to perform classification
**Note**: This is compute heavy step

In [None]:
analyzer_response_list = text_analyzer.analyze_input(
    source_response_list=cleaner_response_list,
    analyzer_config=analyzer_config
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Inform review data in form of Pandas DataFrame

In [None]:
sink_config = PandasSinkConfig(
   dataframe=DataFrame(),
   include_columns_list=included_cols
)

dataframe = sink.send_data(analyzer_response_list, sink_config)
dataframe.rename(rename_cols_dict,axis=1,inplace=True)


dataframe["brand"] = name
dataframe

Unnamed: 0,text,positive,user interface,rewards and discounts,negative,realtime order tracking,social media Integration,easy order placement,easy payment options,ratings,time,brand
0,awesome unmade zomato user switched limited re...,0.72,0.11,0.06,0.02,0.02,0.02,0.01,0.01,5,2021-07-10 12:21:41,zomato
1,best service mast service thii time,0.99,0.26,0.17,0.00,0.16,0.01,0.21,0.29,5,2021-07-10 12:20:34,zomato
2,nice nice,1.00,0.70,0.38,0.00,0.30,0.06,0.44,0.58,5,2021-07-10 12:20:07,zomato
3,listening single cheese burger concern love zo...,0.98,0.81,0.00,0.00,0.05,0.00,0.06,0.07,5,2021-07-10 12:19:20,zomato
4,good good,1.00,0.62,0.42,0.00,0.50,0.05,0.53,0.69,5,2021-07-10 12:15:17,zomato
...,...,...,...,...,...,...,...,...,...,...,...,...
495,nice gud,1.00,0.87,0.30,0.00,0.14,0.08,0.36,0.68,5,2021-07-07 15:54:35,zomato
496,bad experience delivery guy refused take rs no...,0.00,0.29,0.08,1.00,0.02,0.03,0.00,0.00,1,2021-07-07 15:54:24,zomato
497,shikha excellent,1.00,0.94,0.45,0.00,0.48,0.06,0.70,0.91,5,2021-07-07 15:53:40,zomato
498,ordered delivery yet pathetic service,0.00,0.27,0.01,1.00,0.00,0.02,0.00,0.00,1,2021-07-07 15:47:03,zomato


## Store result in Google Drive as CSV

In [None]:
dataframe.to_csv(f'/content/drive/MyDrive/appstore_{name}.csv')