# Sprint 4: Sprint Journal for Runjini Murthy

---
## DAY 1: Tuesday (Week 1)
### Project Narrative
In an effort to use real-world data from my job, I decided to focus my Sprint on Facebook.  Given its wide use, I thought it made sense to work on this project to get familiar with packages in R.  

Currently in my job, I pull organic post data in a very manual fashion.  Facebook is constantly changing their interface, and it's possible I haven't explored every facet, but for the time being, I have been copying and pasting insight data and manually converting it to an Excel spreadsheet.  I use this spreadsheet for further analysis.

This project allowed me to extract very similar data (though not everything; things like reach are missing) more quickly. The RFacebook package provides commands to pull data from Facebook, but first a token must be created to log in to Facebook.  A Facebook account is needed to access this information, including public pages.

Once the key and secret are found from the Developer area, they go into a stored token.  This token is used with the "getPage" function in R to extract a data frame from Facebook.  This is then stored as a CSV file on the local computer, which solves for my issue of quickly downloading Facebook data.

Beyond this, this spring entailed analysis; namely, determining the volume of engagement (likes, shares, comments) on average, for every hour in a 24-hour cycle.  The goal was to determine the highest user engagement (by total engagement) based on the hour we posted to Facebook.

This process made me realize I needed some way to parse the timestamp data I received from Facebook.  Not seeing a native R function to do so (as the timestamps had a "T" in between the date and time, I installed the parsedata package to do so.  

Using the parse_iso_8601 function, I created a new column that parsed out the date and time, removing the "T" in between.  After that, the format function was applied to separate hour year/month/day ```(format(WVFBData$adjusted_timestamp, "%m/%d/%y")```, and ultimately the hour ```(format(WVFBData$adjusted_timestamp, "%H"))```.

Once these new columns of information were created, I set about to determine the mean by hour, for engagement.  To start, I had to create a new column that summed likes, comments, and shares: ```WVFBData$total_engagement<-WVFBData$likes_count + WVFBData$comments_count + WVFBData$shares_count```

Next, I performed simple calculations using native R functions to determine the mean by a given hour: 
```twelveam <- filter(WVFBData, post_hour == "00")
mean(twelveam$likes_count)```

Through some trial and error, I found out that my timestamps were actually UTC, so they were ten hours ahead of Honolulu time.  I wen thtrough the timestamp adjustments again, subtracting the hours' difference in seconds based on some reference links I found, which stated the seconds for timestamps are based off of seconds elapsed since 1970.  At any rate, subtracting 10*60*60 resulted in the correct time.

My last goal is to plot this engagement by hour.  I hoped to be able to do this within the ggplot function.  I ultimately landed on the correct function: ```ggplot(WVFBData) + geom_bar(aes(x=post_hour,y=total_engagement),stat="summary", fun.y = "mean",fill=I("grey50"))``` but initially missed the 'fun.y = "mean"' portion which Ben found.  Without that element, the plot was summing all the engagement by each hour, but not calculating a mean.  After making this correction, I saw the highest engagments at 5pm, 2pm, and 7pm.  This matched the quick analysis I performed in an Excel pivot table and chart, which guided the accuracy of my answers.

### What I Expect to Learn
* I expect to begin my foundational understanding of libraries and perform my first real sprint where I understand how it works.  I imported a pandas library in a previous sprint during a research phase, but did not delve into the details.  I wish to learn at least 5-10 functions/use cases for the library.

### Project References
#### Training/Research Links
- http://dataconomy.com/2015/03/14-best-python-pandas-features/: This seems quite helpful in describing key functionality in Pandas and how it applies.
- http://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/
- http://pythonforengineers.com/introduction-to-pandas/: Intro to pandas
- http://pythonforengineers.com/machine-learning-for-complete-beginners/: Saving this link for later for machine learning.  Seems like a good tutorial here.
- https://www.analyticsvidhya.com/blog/2014/07/facebook-analyst/
- https://www.r-bloggers.com/analyze-facebook-with-r/
- https://research.fb.com/prophet-forecasting-at-scale/ 
- https://github.com/facebook/prophet 
- https://bigdataenthusiast.wordpress.com/2016/03/19/mining-facebook-data-using-r-facebook-api/
- https://www.datacamp.com/community/tutorials/r-packages-guide
- How to add a column using existing column data: http://rprogramming.net/r-data-manipulation/
- Parsedate package: https://github.com/gaborcsardi/parsedate
- Background on installing chron package: https://www.stat.berkeley.edu/~s133/dates.html
- Documentation on how to parse date time (assumes proper class for date format): https://stat.ethz.ch/pipermail/r-help/2005-October/080979.html
- Documentation on dplyr package: http://www.datacarpentry.org/R-genomics/04-dplyr.html
- Documentation on how to change time by using some addition/subtraction of seconds: http://www.dummies.com/programming/r/how-to-format-and-perform-operations-on-dates-and-times-in-r/
- Including mean in the ggplot call: https://stackoverflow.com/questions/30183199/ggplot2-plot-mean-with-geom-bar

#### Sample Projects
- https://blog.kissmetrics.com/how-airbnb-uses-data-science/ - AirBnb product/marketing case study
- http://inseaddataanalytics.github.io/INSEADAnalytics/BoatsSegmentationCaseSlides.pdf 
- http://www.skampakis.com/data-science-marketing-tool-online-ads/ - Online ads
- https://www.r-bloggers.com/data-analysis-for-marketing-research-with-r-language-1/ - Sample R project


### Project Pitch
- Analyze marketing data set using pandas.
- Per the suggestion of Nat, I will either use a marketing data set from a current project at work, or find something similar from Kaggle.  Then, I will use go through a trial process of reviewing each function in Pandas to see what it does and how it works.
- Skill Story: As a budding marketing data analyst, I need to understand how to utilize libraries/packages so that I have alternative methods to perform social media campaign analysis other than Excel.  


### Other Notes
- Pandas is a dependent library on numpy.
- Learned that by inputting a file name into the .gitignore file, you can prevent those files from being pushed to your repo.  
- The .gitignore file is edited in Atom or another text editor.
- The .gitignore file is hidden, so you can't navigate to it.  I did a save and replace on a whim in Atom and it seems to have worked.

--- 
## DAY 2: Wednesday (Week 1)
### Pitch Feedback 
*These are comments I received or ideas I had from people's feedback to my pitch *

### Prototype Notes
Needed a bit of a refresher for work in R.  Reviewed functions like min, max, subset, and storing files to an easier value to work with.

Learned that not only do you have to install a package, you have to get it in an executable mode with the function "library."
``install.packages(“Rfacebook”)
library("Rfacebook") ``

Had to go through the process of getting token access through the Facebook developer page. 

``myoauth <- fbOAuth(app_id="Facebook-ID", app-secret="Facebook-secret")``

Saved the token for easy access. 

``save(myouath, file="myoauth")
load("myoauth")
me <- getUsers("me", token = myoauth)
me$name``

``Runj Ward [i.e. this is the result at the me$name command]``
 
Learned how to access public page data with the getPad function.  You need to determine the ID of the public page from this website: https://findmyfbid.com/.  The Ward Village ID is 310668279073583.

Some functions to review content: Returns first ten rows of public page data

``getPage (310668279073583, token = myoauth, n = 10) ``

Outstanding questions: 
 * How do I filter for 2017 posts only?  (A subset function off of created_time is running into issues with the T00:00:00 text.)  
 * How can I bucket the posts into certain groups of content to make conclusions about what content performs well? 
 * How can I add paid post/paid ad data? 
 * What are the top performing pieces of content by type? 
 * How do I take this project to the next level?

### Pair Show & Tell Comments
*Comments from prototype discussion*

*These comments lead into plan development. Key considerations for me and my partner:  Do I have a plan? Is my plan feasible?*

### Additional Links
1. https://stackoverflow.com/questions/24719489/failure-to-get-fboauth-in-rfacebook - Instructions for how to set up the localhost URL within the Facebook developer area.
2. Instructions for Rfacebook package: https://bigdataenthusiast.wordpress.com/2016/03/19/mining-facebook-data-using-r-facebook-api/
3. Instructions on why fboauth didn't work - need to install packages and then run it as a library: https://stackoverflow.com/questions/27682059/rfacebook-could-not-find-function-fboauth
4. Read and write links: http://rprogramming.net/read-csv-in-r/ and http://www.instantr.com/2012/12/11/exporting-a-dataset-from-r/

### Proposed Plan: Key Milestones by Day

##### Day 2 (Wed Week 1):
- Develop Project Proposal
- Push Docs / Repo / Roadmap Update

##### Day 3 (Thu Week 1)
- Finish analysis (i.e. to answer marketing questions on optimal post time and what content drives engagement) -- Only got to the optimal post time question, but was a very useful exercise
- Researching additional libraries: parsedate, chron, ggplot -- parsedate allowed me to separate out the date/timestamp
- Create program in R -- still need to fix this, but on Ben's advice, aiming to create a mini-program that can run
- Push docs / repo

##### Day 4 (Tue Week 2): 
- Figure out means by post hour within the ggplot function
- Streamline program to most relevant pieces of code
- Push docs / repo

##### Day 5 (Wed Week 2):
- Project highlights
- Identify question or cohort knowledge gap for sprint review
- Develop Topic Project + Presentation
- Push Repo / docs / Presentation

### Project Definition and README.MD Discussion 
*This is a discussion of how this project will fit into my overall roadmap. I will update my roadmap with the following project definition*

*I will focus my project Repo's README.MD on the same topic, but with this additional detail.*


In [1]:
install.packages(“Rfacebook”)
library("Rfacebook") 

myoauth <- fbOAuth(app_id="Facebook-ID", app-secret="Facebook-secret")

save(myouath, file="myoauth")
load("myoauth")
me <- getUsers("me", token = myoauth)
me$name

ERROR: Error in parse(text = x, srcfile = src): <text>:1:18: unexpected input
1: install.packages(<e2>
                     ^


--- 
## DAY 3: Thursday (Week 1)

#### Setup for Repo and Documentation Push
*Setup and testing I did to make sure my repo and documentation were ready to push at the end of the day*

#### Repo File Strategy Discussion
*How I will present my repo files for clarity and demonstration*

#### Work towards analysis
- Ultimately focused on optimal post time.  Needed to first start with libraries to parse the timestamp.  Once that was done, I performed some basic calculations on mean by post hour to ensure I could calculate it.  I also duplicated this work using an Excel pivot table to have something to measure it against.

#### Additional R packages
- Reviewed parsedate, chron, ggplot

#### Create program in R
- This is set up but still has a lot of one-off calls (for validation purposes).  Needs to be streamlined so it could serve as code someone could run.


---
## DAY 4: Tuesday (Week 2)

#### Work in Progress Feedback 
*Feedback and ideas from my work in progress presentation *

#### ggplot + grouping post hour by means
- Hit major roadblocks here until Ben shared this article to help: https://stackoverflow.com/questions/30183199/ggplot2-plot-mean-with-geom-bar

#### Steamline R script
- Done; edited out unnecessary code


--- 

## DAY 5: Wednesday (Week 2)

### Project Highlights: The things I am most excited about in my project
Me
- Automated process to download post data from Facebook.
- Token process
- Parsing timestamp
- How to use the mean function in ggplot

Peer Identified

- Michael mentioned the notion of positive/negative sentiment when it comes to comments, which is another good idea for future projects.
- The default timestamp for the Facebook posts is probably UTC, since it was ten hours ahead of the actual HST post time.

### Peer Repo Feedback - From Ben

- Run snippets of code in separate cells.
- Figure out how to run R in Jupyter Notebook.
- Include narrative at top of journal to summarize process for project.
- Steamline repo folder to have succinct READme, PPT presentation, and script as main files in the repo folder.  Any other images/files can be placed in an "Other Files" subfolder.
- For local computer, set up parallel Sprint folders to store sensitive data (i.e. code with full Facebook ID and secret).


## Day 6: Thursday (Week 2)
--- 

#### Things I didn't get to
- Using a facet on the date (not time) parameter to show multiple subset plots fo different quarters.  This would get at seasonality and seeing how time of year might influence post hour engagement.
- How do I filter for 2017 posts only?
- How can I bucket the posts into certain groups of content to make conclusions about what content performs well?
- How do I do sentiment analysis on the engagement we were receiving; i.e. positive or negative.  Can this be filtered out?
- How can I add paid post/paid ad data?
- What are the top performing pieces of content by type?
- Figuring out how to run R snippet in Jupyter Notebook.
