## Breaking the Problem Down into Smaller Pieces
Here I am outlining the project and breaking it into more digestible pieces of work down to the task level.  This will give me a better understanding of required resources and deadlines.

I will be distilling each task down to first steps.

Ideally I will get down to 10 minute intervals.

1. Data Collection
    - Data sources
        - Amazon reviews
            - [AWS Glue](https://aws.amazon.com/glue/)
                - Using a [Crawler](https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html)
        - App store reviews
            - apple
                - may not allow scraping
            - android
        - G2 software reviews
        - Reddit
        - Twitter
    - Features
        - Review text
        - Stars
        - Product Name
        - Brand
        - Date
        - User Name
        - Verified purchase?
        - Picture(s) of item
        - Relevance
            - "Helpful" status of review (Amazon)
            - Thumbs Ups (Android App Store)
            - Upvotes (reddit)
            - Likes (twitter)
    - Amount of Data
        - Will shoot for a range of 1-3 products
            - 1 hardware
            - 1 software established 5+ years
                - Snap- Chat (20M reviews)
                - Whats App (97M reviews)
                - pics art photo editor (8M Reviews)
                - RobinHood (98,000 reviews)
            - 1 upstart software < 3 yrs but >2000 reviews
                - Disney +
                    - 12k reviews, but just released, 
                    - version 1.1.3 so not a lot of updates to use)
                - Cake Web Browser (
                    - 98k reviews, 
                    - released in 2018, 
                    - version 5.1.02)
                - Google assistant
                    - 157k reviews
                    - Hasn't been updated since march 2018 though
                
        - 5000 + reviews each
            - May start with 1000 if 5000 each is too slow.
        - Find API's
            - Must be able to collect 1000+ reviews for an item
            - May need to use various API's for different data sources
                - Amazon
                - G2
                - Twitter
                - Reddit
                - App stores
        - Combine data into a dataframe
    - Look for a pre-categorized data set of reviews.
            
        
2. Data Cleaning
    - Filter out short reviews (under 5-10 words).
    - Null Values
    - Missing fields (considering variety of sources)
    - Filter out spammy reviews
        - Train an NLP on amazon's [non-compliant reviews dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
3. EDA
    - Length of reviews
    - Star ratings
    - Sentimaent analysis
    - Bar Chart comparing data sources
        - amount of reviews
        - average star ranking
    - Cluster chart of reviews
    - Remove Duplicates
    - Filter out fake reviews if possible
    - Consider filtering down to users with more than 5 reviews
    - Determine, of the actionable reviews, if subcategorization would be useful
        - UX/UI
        - Delivery
        - Bugs
        - Quality
        - Material
        - Functional flaw
        - Design
        - etc.
4. Pre-Processing and Feature Engineering
    - NLP
        - Sentiment analysis
        - Top grams, bigrams, words
            - Remove stop words
        - Sentence tokenizer
        - Multi Label Naive Bayes Classifier
            - May want to use [label powerset transformation](http://scikit.ml/userguide.html)
    - Word2Vec
        - Find actionable and insightful words and similar words
        - Will help to generalize terms by including similar words to the insight lexicon
    - spacy
        - Identify those words in sentences
        - Dependency parser can help with categorizing insights
        - Can vectorize sentences to be used in a clustering model to potentially determine insight categories.
    - Neural Coref v2.0 
        - Pre-trained neural network
    - Vader
        - provide sentiment analysis with aid of star rating
    - Remove stop words from 
    - Categorize the actionable insight
        - TBD 
5. Model Tuning
    - Cluster model
    - Convolution Neural network
        - [Swarm algorithms](https://www.sciencedirect.com/topics/engineering/swarm-intelligence-algorithm)
        - Baysian hyperparameter optimization
        - [Keras grid search](https://keras-team.github.io/keras-tuner/)
6. Production Model and Insights

## Concepts and Aspects I would like to Include
- Swarm Algorithm
- baysian hyperparameter optimization
- Word2Vector
- spacy
    - Custom word list
    - Sentence tokenization 
- Lexicon of actionable insight words
    - May derive this from neural network
- App store reviews - maybe from android
    - As training to see which suggestions from comments were incorporated into product updates.
- G2 reviews for software
    - As training to see which suggestions from comments were incorporated into product updates.
    - Determin negative reactions after updates to predict future reactions to features and updates
- Computer vision
    - Collect feedback on same items using picture, compare feedback
    - Get feedback from similar items
        - can combine with genre or categorization
- Neural Network

### What am I Predicting?
- If a comment is categorized as containing actionable insights
    - Extracting that sentence and returning it.
- Predicted feedback based on similar items
- Predicting similar items based on photos
- What could be the impact of a future feature change
    - Train [ReAgent](https://github.com/facebookresearch/ReAgent) on updates and reactions in reviews to determine what consequences could be for a feature or change

## Week 3 Checkin

<details>
    <summary>Do you have data fully in hand and if not, what blockers are you facing?</summary>
- Sort of, I'm working on getting a aws glue query setup and trying to determine if that would be expensive considering the size of the possible available dataset.
- I have 6Million reviews of electronics as a downloaded dataset to work with
- I have 3,000 reviews of Hubspot from G2 that I am currently working with
- I have 5000 reviews of Snapchat from the android appstore downloaded but have not scraped yet and am having difficulty extracting the data.</details>

<details>
    <summary>Have you done a full EDA on all of your data?</summary>
- Have looked at nulls
- Content looks useable
- Still need to do NLP</details>

<details>
    <summary>Have you begun the modeling process? How accurate are your predictions so far?</summary>
- Not yet</details>

<details>
    <summary>What blockers are you facing, including processing power, data acquisition, modeling difficulties, data cleaning, etc.? How can we help you overcome those challenges?</summary>
- Accessing Amazon Data, I have 1 dataset, would like more from AWS access.
- Found scraping app store to be difficult, not sure I will have this feature (app updates) as validation of actionable insight.
- Undecided on whether to go deep or broad with analysis with regards to products
    - Should I make generally useful information or very specific to one vertical
        - software
        - electronics
        - physical products
        - small subset of products (3 products with many reviews?)
    - largest dataset is from amazon, which does not have much software outside of video games.</details>
    

<details>
    <summary>Have you changed topics since your lightning talk? Since you submitted your Problem Statement and EDA? If so, do you have the necessary data in hand (and the requisite EDA completed) to continue moving forward?</summary>
- I changed from lightning but am consistent with our last meeting.
- EDA is lagging but in sight now that I have 2 datasets to work with 
What is your timeline for the next week and a half? What do you have to get done versus what would you like to get done?
- Do NLP on existing datasets
- Use CNN on reviews
What topics do you want to discuss during your 1:1?
- AWS and accessing portions of large datasets, SQL?</details>
