# Obsei Tutorial 02
## This example shows following Obsei workflow
 1. Observe: Play Store's app reviews
 2. Pre-process: Clean review text with properly
 3. Analyze: Classify review text within given category list
 4. Inform: Provide all data in Pandas DataFrame
 5. Store: Store data in Google Drive in CSV format

## Install Obsei from latest code, perform these steps -
- Select GPU RunType for faster computation 
- Restart Runtime after installation

In [None]:
!pip install obsei[all]

Collecting git+https://github.com/lalitpagaria/obsei.git
  Cloning https://github.com/lalitpagaria/obsei.git to /tmp/pip-req-build-9q4fz4j2
  Running command git clone -q https://github.com/lalitpagaria/obsei.git /tmp/pip-req-build-9q4fz4j2
Building wheels for collected packages: obsei
  Building wheel for obsei (setup.py) ... [?25l[?25hdone
  Created wheel for obsei: filename=obsei-0.0.9-cp37-none-any.whl size=65557 sha256=bc7c8c937eed4a7b325b3ef8e46de64e44778e40914d99267356cc4ce36c7c27
  Stored in directory: /tmp/pip-ephem-wheel-cache-qhkx9sy8/wheels/49/1a/6e/2fd83c9a275b7096fc615a0edef2d55b1fc33c3751ba45c1ad
Successfully built obsei


## Mount your Google Drive to store CSV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Configure following input -
- `name`: Brand name of App
- `category_list`: List of categories to perform review text classification
- `identifier`: Package name of the app, it can be found at the end of the url of app in play store
- `country`: Country of reviews
- `lookup_period`: How many old reviews to collect (**Note**: Google rate limit and provide max 200 reviews only)
- `extra_stop_words`: Extra stop words top clean from review text



In [None]:
name = "zomato"
category_list = ["easyOrder placement", "Realtime order tracking", "easy payment options","Rewards and discounts","user interface","social media Integration",]
identifier = "com.application.zomato"
country = "in"
lookup_period = "365d"
extra_stop_words = ["i", "-", "day", "will", ".", "use", "n", "without", "please", "app", "ha", "ho", "nt", "wa", 
                    "thi", "plz", "pleas", "ff", "ya", "thank", "you", "thanks", "mai"]

## Configure columns of Pandas DataFrame
`included_cols` will only be returned by Pandas Sink and `rename_cols_dict` will rename selected `included_cols` columns to desired one

In [None]:
included_cols = [f"segmented_data_classifier_data_{category}" for category in category_list]
included_cols.append("segmented_data_classifier_data_positive")
included_cols.append("segmented_data_classifier_data_negative")
included_cols.append("processed_text")
included_cols.append("meta_at")
included_cols.append("meta_date")
included_cols.append("meta_published date")
included_cols.append("meta_score")
# included_cols.append("meta_title")
included_cols.append("meta_publisher_title")

rename_cols_dict = {f"segmented_data_classifier_data_{category}": category for category in category_list}
rename_cols_dict["segmented_data_classifier_data_positive"] = "positive"
rename_cols_dict["segmented_data_classifier_data_negative"] = "negative"
rename_cols_dict["processed_text"] = "text"
rename_cols_dict["meta_at"] = "time"
rename_cols_dict["meta_date"] = "time"
rename_cols_dict["meta_published date"] = "time"
rename_cols_dict["meta_score"] = "ratings"
# rename_cols_dict["meta_title"] = "title"
rename_cols_dict["meta_publisher_title"] = "news publisher"
rename_cols_dict['Unnamed: 0'] = 'reviews'

## Configure Play Store Review Observer




In [None]:
from obsei.source.playstore_scrapper import (
    PlayStoreScrapperSource,
    PlayStoreScrapperConfig,
)

source_config = PlayStoreScrapperConfig(
    countries=[country],
    package_name=identifier,
    lookup_period=lookup_period
)

source = PlayStoreScrapperSource()

## Configure TextCleaner as Pre-Processor to clean review text
These cleaning function will run serially

In [None]:
from obsei.preprocessor.text_cleaner import TextCleaner, TextCleanerConfig
from obsei.preprocessor.text_cleaning_function import *

text_cleaner_config = TextCleanerConfig(
    stop_words=extra_stop_words,
    cleaning_functions = [
        ToLowerCase(),
        RemoveWhiteSpaceAndEmptyToken(),
        RemovePunctuation(),
        RemoveSpecialChars(),
        DecodeUnicode(),
        RemoveDateTime(),
        RemoveStopWords(),
        RemoveStopWords(stop_words=extra_stop_words),
        RemoveWhiteSpaceAndEmptyToken(),
   ]
)

text_cleaner = TextCleaner()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Configure Classification Analyzer
**Note**: Select model from https://huggingface.co/models?pipeline_tag=zero-shot-classification, if you want to try different one

In [None]:
from obsei.analyzer.classification_analyzer import ClassificationAnalyzerConfig, ZeroShotClassificationAnalyzer

analyzer_config=ClassificationAnalyzerConfig(
   labels=category_list,
)

text_analyzer = ZeroShotClassificationAnalyzer(
   model_name_or_path="typeform/mobilebert-uncased-mnli",
   device="auto"
)

## Configure Pandas DataFrame Informer




In [None]:
from pandas import DataFrame
from obsei.sink.pandas_sink import PandasSink, PandasSinkConfig

sink_config = PandasSinkConfig(
   dataframe=DataFrame(),
   include_columns_list=included_cols
)
sink = PandasSink()

## Fetch app reviews

In [None]:
source_response_list = source.lookup(source_config)

## PreProcess review text to clean it



In [None]:
cleaner_response_list = text_cleaner.preprocess_input(
    input_list=source_response_list,
    config=text_cleaner_config
)



## Analyze reviews to perform classification
**Note**: This is compute heavy step

In [None]:
analyzer_response_list = text_analyzer.analyze_input(
    source_response_list=cleaner_response_list,
    analyzer_config=analyzer_config
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Inform review data in form of Pandas DataFrame

In [None]:
dataframe = sink.send_data(analyzer_response_list, sink_config)
dataframe.rename(rename_cols_dict,axis=1,inplace=True)


dataframe["brand"] = name
dataframe

Unnamed: 0,text,positive,easy payment options,easyOrder placement,user interface,Realtime order tracking,Rewards and discounts,social media Integration,negative,ratings,time,brand
0,good,1.00,0.67,0.65,0.60,0.43,0.35,0.06,0.00,5,2021-07-11 17:09:17,zomato
1,excellent loving,1.00,0.20,0.19,0.32,0.10,0.11,0.01,0.00,5,2021-07-11 17:08:09,zomato
2,delievered wrong house,0.00,0.00,0.00,0.26,0.00,0.02,0.03,0.99,1,2021-07-11 17:07:36,zomato
3,superb excellent,1.00,0.55,0.57,0.71,0.28,0.20,0.02,0.00,5,2021-07-11 17:07:17,zomato
4,good,1.00,0.67,0.65,0.60,0.43,0.35,0.06,0.00,4,2021-07-11 17:05:58,zomato
...,...,...,...,...,...,...,...,...,...,...,...,...
195,sellers cheat users selling less quantity cont...,0.18,0.00,0.04,0.07,0.04,0.08,0.03,0.68,1,2021-07-11 16:08:05,zomato
196,nice service,0.99,0.81,0.40,0.60,0.12,0.28,0.02,0.00,5,2021-07-11 16:07:52,zomato
197,amazing experience far,0.99,0.02,0.04,0.21,0.09,0.02,0.01,0.00,5,2021-07-11 16:07:53,zomato
198,delivery fast less offers cash delivery,0.94,0.94,0.17,0.62,0.13,0.06,0.03,0.42,2,2021-07-11 16:07:38,zomato


## Store result in Google Drive as CSV

In [None]:
dataframe.to_csv(f'/content/drive/My Drive/playstore_{name}.csv')