Setup post and basescraper with QuoraProfileScraper #1249

vibhcool · 2017-06-15T19:09:51Z

This PR is follow-up of PR #1250
These are the changes so far I have made:-

Modify abstract class BaseScraper that will be inherited by all scrapers
Configure QuoraProfileScraper with BaseScraper and Post : Related Issue Add search scraping in QuoraProfileScraper #1230
Add Timeline2 class : Related Issue Add Scrapers to SearchServlet #1231 , Refactor code of TimeLine.java #1244
Timeline iterates MessageEntry objects. A lot of classes are dependent on Timeline , so it can't be configured to iterate Post objects. That is why till the time, I TwitterScraper works independently and is not configured with Post and BaseScraper, I have created Timeline2 class.
Configure Timeline2 with QuoraProfileScraper : Related Issue Add Scrapers to SearchServlet #1231

To test: http://127.0.0.1:9000/api/quoraprofilescraper?query=Vibhor-Verma-5

EDIT 1: I have added TODO's where I have to make changes.

EDIT 2:
TODOs refer to:-

the lines which uses Timeline2 as signature. Here signature needs to be fixed after Timeline gets replaced by Timeline2 class. Timeline2 is temporary class.
Dummy variables used: the dummy variables stores the parameters that can be fetched. I have used them. Here get-parameters have to be fetched as extra parameters.

EDIT 3: diff Timeline.java and Timeline2.java. Both are very much alike.

Short description

I have:

There is a corresponding issue for this pull request.
Mentioned the Issue number in the pull request commit message Fixes #<number> commit message
There is only strictly only one commit per issue.

For the reviewers

I have:

Reviewed this pull request by an authorized contributor.
The reviewer is assigned to the pull request.

vibhcool · 2017-06-16T04:51:37Z

@Orbiter @sudheesh001 @daminisatya @jig08 @mariobehling @singhpratyush @kavithaenair @SKrPl @hemantjadon @Achint08 @sarishinohara @djmgit
Please review the changes.
This is follow-up of PR #1250

kavithaenair

Check the import statements and put them in lexical order.

kavithaenair · 2017-06-16T09:17:36Z

src/org/loklak/harvester/BaseScraper.java

+import java.lang.StringBuilder;
+import org.loklak.server.AbstractAPIHandler;
+import org.loklak.data.DAO;
+import java.net.URL;


Follow lexical order here as well.

kavithaenair · 2017-06-16T09:18:04Z

src/org/loklak/harvester/Post.java

+
+import org.json.JSONObject;
+import org.loklak.data.DAO;
+import java.net.URL;


kavithaenair · 2017-06-16T09:19:02Z

src/org/loklak/harvester/Post.java

+    public void setPostId() { }
+
+    //TODO: Set up TwitterTweet before setting this as abstract
+    public String getPostId() { return new String(); }


This doesn't look neat. Can you follow as done in line 44.

sudheesh001 · 2017-06-17T12:57:49Z

src/org/json/JSONObject.java

        }
        Object object = this.opt(key);
        if (object == null) {
-            throw new JSONException("JSONObject[" + quote(key) + "] not found.");
+            //throw new JSONException("JSONObject[" + quote(key) + "] not found.");
        }


These are from the org/json library. Edits to this are really not recommended.

sudheesh001 · 2017-06-17T12:58:25Z

src/org/loklak/api/search/ConsoleService.java

        });
+        */


Any reason why this is entirely commented out?

sudheesh001 · 2017-06-17T13:00:14Z

Please provide a test link.

hemantjadon

@vibhcool I am not really sure of this approach in the BaseScrapper. providing the URL in many chunks, baseUrl, midUrl, query, extra I mean I am not sure this will help in any way instantiating the instance, IMO this will create the confusion for the dev who wants to extend this class to modify something, this is tricky, as a URL inherently A complete string, and it has only 4 parts, Protocol(http/https), Domain, (In some cases a sub domain). the remaining part which is the entire path, and some queryParams, Splitting the url in our own terminology might create confusion, Is there a solid reason to do so ??

vibhcool · 2017-06-20T09:11:20Z

@hemantjadon , I have used baseUrl which points to the website which points to website.
midUrl points to part(or webpage) of website we want to scrape
query is usable at various methods in the program.
I planned to fetch extra parameters from request body instead from url string. I have declared it, not used it till now.
I will add all parameters in request body, as suggested by @singhpratyush , and keep url as a string of baseUrl and midUrl.
thus, trying to keep everything hastle-free :)

vibhcool · 2017-06-22T17:56:42Z

@sudheesh001 made all suggested changes, added test link :)
@kavithaenair fixed all syntax errors and codacy errors (some will be fixed in next PRs) :)
Please re-review

Achint08

LGTM 👍

singhpratyush · 2017-06-23T19:47:13Z

src/org/loklak/harvester/Post.java

+        return new Date(this.timestamp);
+    }
+
+    //TODO: Set up TwitterTweet before setting this as abstract


I have inherited Post here -> https://github.com/loklak/loklak_server/pull/1249/files#diff-c7c5000d9cf0abf8e64f6de0914d35d3
If I declare methods abstract in Post right now, everywhere where AbstractObjectEntry is inherited will have to define them. And not all of those child classes are scrapers

not all of those child classes are scrapers

But they still can have an unique identifier..

They doesn't seem to be needed right now. Also some of those classes are unrelated to what I am trying to achieve. So that is why, just to get TwitterTweet be subclass of Post, to get results at top, I have made this arrangement. 😅
See PR #1277

Achint08 · 2017-06-23T21:53:11Z

src/org/loklak/objects/Timeline2.java

+
+    public static enum Order {
+        CREATED_AT("date"),
+        TIMESTAMP("long"),


Just wanted to know Why TIMESTAMP is assigned as a long data type?

@Achint08 TimeStamp Is an Unsigned Int U64 thats why. :)

timestamp is no. of sec from 1 jan 1970, used for creating PostId, one can use it to create date from it.

Ok ok cool.

singhpratyush · 2017-06-28T16:04:29Z

src/org/loklak/objects/Timeline2.java

@@ -0,0 +1,457 @@
+/**


@vibhcool: Could you please point out the things in this class that are different from Timeline?

this is Timeline class , just that, it iterates Post objects instead of MessageEntry objects.

singhpratyush · 2017-06-28T16:07:53Z

src/org/loklak/objects/Timeline2.java

+    }    
+
+    //TODO: this passes Timeline as argument
+    public void writeToIndex() {


If the plan is to write things to ES index, I think that Timeline should also define the index name.

Also, as the number of scrapers increases, the initialisation of ES node should automatically create indices. How can this be handled?

I just commented this part. This isn't relevant now but may be needed when I get some scrapers start working together. I added TODO to work on this in another PR. This is pre-existing code of Timeline.java

That is what I am talking about. Once you get more scrapers into work, you'll need to generalise the class for them. One of the requirements for it would be defining the index in which the data would need to get pushed.

@singhpratyush @sudheesh001 , about the structure of ES index, the existing structure can work with multiple scrapers. Some changes I think that are needed are:-

presently there is IndexEntry object that acts as the interface between the ElasticSearch and the Scraper system. There is a need to change it's generic type's superclass to Post

There can be 2 approaches: - an index per scraper or a Document per scraper . For both the existing structure can work. some lines of code will be needed to be added to set this up.

2nd point needs some discussion. I will create issue for this point. :)

@singhpratyush @sudheesh001 see this #1290

singhpratyush · 2017-06-28T16:12:20Z

@vibhcool: The test link is not operational. Please check.

vibhcool · 2017-06-28T19:23:16Z

@singhpratyush updated the test link, please re-review :)

SKrPl · 2017-06-29T19:08:46Z

@vibhcool Please look into codacy issues.

vibhcool · 2017-06-29T19:32:49Z

@SKrPl I have fixed the codacy issues. the present 2 issues that are popping up are needed to be kept till refactoring of TwitterScraper. 😅

kavithaenair

Rebase the branch. Rest LGTM 👍

sudheesh001 · 2017-06-30T05:51:48Z

I like the updates in this pull request however I agree with @singhpratyush suggestions regarding the write to index being a separate class so that as more scrapers are added there would be a single general interface that can be used to push to the local elastic search index.

setup post and basescraper

singhpratyush · 2017-07-04T02:55:44Z

src/org/loklak/harvester/Post.java

+    }
+
+    //TODO: Set up TwitterTweet before setting this as abstract
+    public void setPostId() { }


No parameter in setter method?

https://dzone.com/articles/getter-setter-use-or-not-use

go through the TODO mentioned, it is to be set abstract.

Even if you set it to abstract, it should take some argument to which it can set the ID to.

I still don't understand how setting this abstract now will not work but setting it abstract after introducing TwitterScraper will work. If you don't have a unique identifier for QuoraScraper now, how will you produce it when TwitterScraper is introduced?

And I still think that there can be a unique identifier for each Quora profile that is scraped.

made the changes

singhpratyush · 2017-07-04T02:56:02Z

src/org/loklak/harvester/Post.java

-    //public abstract String setPostId();
+    //TODO: Set up TwitterTweet before setting this as abstract
+    public String getPostId() {
+        return new String();


Better to return "";.

doesn't matter , see the TODO mentioned

made the changes

singhpratyush

Left few comments. Please take a look.

codecov-io · 2017-07-04T07:49:09Z

Codecov Report

Merging #1249 into development will decrease coverage by 0.02%.
The diff coverage is 0.98%.

@@               Coverage Diff                @@
##             development   #1249      +/-   ##
================================================
- Coverage           9.06%   9.04%   -0.03%     
- Complexity           393     396       +3     
================================================
  Files                199     200       +1     
  Lines              17214   17403     +189     
  Branches            3223    3252      +29     
================================================
+ Hits                1561    1574      +13     
- Misses             15346   15522     +176     
  Partials             307     307

Impacted Files	Coverage Δ	Complexity Δ
src/org/loklak/susi/SusiThought.java	`15.38% <ø> (ø)`	`5 <0> (ø)`	⬇️
src/org/loklak/harvester/BaseScraper.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
src/org/loklak/api/search/QuoraProfileScraper.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
src/org/loklak/objects/AbstractObjectEntry.java	`5.88% <0%> (ø)`	`5 <1> (ø)`	⬇️
src/org/loklak/api/search/ConsoleService.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
src/org/loklak/harvester/Post.java	`58.82% <0%> (+58.82%)`	`2 <0> (+2)`	⬆️
src/org/loklak/objects/Timeline2.java	`0% <0%> (ø)`	`0 <0> (?)`
src/org/loklak/objects/MessageEntry.java	`24.93% <22.22%> (-0.07%)`	`23 <0> (ø)`
src/org/json/JSONObject.java	`22.87% <0%> (+0.34%)`	`57% <0%> (+1%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3aef8be...24911f1. Read the comment docs.

singhpratyush

I am approving this as this is required to introduce other components and other issues can be taken care of as we introduce them.

Further, the idea of writing to index needs to be carefully thought of so that the plan of "indexing everything" works

* deploy button info for docker #1001 * Fixes #1045 : Replace the image logo in navigation bar with a text Fixes #1045 : Replace the image logo in navigation bar with a text * Fixes #1048: fix execution of method without query string * fix for latest twitter html change * add docker status badge This is related to issue #1049 * update documentation path Problem: The documentation has moved. The links in the README are outdated Solution: insert the containing folder into the path of all links to docs * moved dockerfile to root folder #1049 * changed travis build to new location * updated Dockerfile path for compose This is for issue #1049. The pull request #1050 is a precondition for this to make sense. Problem: the Dockerfile was moved to / Solution: adapt the path * get aggregations also with fresh requests from twitter with source=all * Fixes #1060: Increase default Xmx value * Fixed #1059 - Remove and Ignore .DS_Store * Fixes #1067 - Tweet URL in README is broken * corrected heading "Where do I find the java?" ->"Where do I find the Java documentation?" * Using the note directive of sphinx See #1042 * README.md upd, useful links added * Fixes #1033, loklak_server README.md upd, links updated with link syntax * Move documentation site The documentation site is now moved to https://github.com/loklak/dev.loklak.org Closes #1014 * fix username emoji in tweet * Fix unused imports in python files(codacy issue) Related to #1070 * removed .DS_Store * added .DS_Store to gitignore * Fix use of Null in scala code Related to #1070 * fixed scraper * Edited Readme * Add update trigger script for docs Closes #1003 * Creating Volume for persistence while deploying via docker, fix #1051 (#1089) * Update Dockerfile * Update Dockerfile * Update docker-compose.yml * Update docker-compose.yml * Update docker-compose.yml * Update Dockerfile * Update Dockerfile * Update Dockerfile * Update Dockerfile * Update docker-compose.yml * Update docker-compose.yml * Update docker-compose.yml * Update docker-compose.yml * updated docker build badge I changed the url so github requests a new image. The build works. https://hub.docker.com/r/mariobehling/loklak/builds/ * Docker: Consistent Volume Path Problem: docker-compose volume path is not the same as the dockerfile volume path Solution: Set the docker-compose volume path to the dockerfile volume path You can view the correct path in the Dockerfile: https://github.com/loklak/loklak_server/blob/7a1f0378dc40ec25eec6083e43558a62408d84e8/Dockerfile#L38 I checked in the container: ``` bash-4.3# ls /loklak_server/ bin conf gradlew settings.gradle build data html src build.gradle gradle installation ssi bash-4.3# ls / bin lib proc srv var dev loklak_server root sys etc media run tmp home mnt sbin usr ``` the data directory exists and is filled within `/loklak_server` * .travis.yml: Add keys for dev.loklak.org Closes #1091 * fix initGet * option to autodelete messages after one month from the main index * disabling feature introduced with 27272ee for issue #919 The storage of the settings file caused that the settings file was broken. It blew up to a huge file, like $ ls -l customized_config.properties -rw-r--r-- 1 loklak loklak 251650030 Apr 10 19:08 customized_config.properties This is the main cause that loklak.org was down since this feature was introduced. * Fixes #1099 : Changes the href link of the button download, install and extend * fix #1056 - document how to start contributing (#1063) * Added JS EventListener to resize dump iframe on load. Closes #1101 * Add Unit Tests to Loklak Server (#1098) * Add unit tests for TwitterScraper.java * Add data file to test JSONRandomAccessFileTest.java * set up unit tests build in loklak Server * fix changes requested and codacy issues * fixes scrollbar event * at the twitter scraper now use more readable version of assert, also fix bug with parse long in youtube scraper(fails on Long.parse method, because spaces are not removed), add unit test for youtube scrapper. * fix bug with youtube scrapper and add unit test for scraper * Fixes #1103: Changed the URLs to the correct ones (#1104) * Fixes #1103: Changed the URLs to the correct ones * Fixes #1108: Fixed the typos in documentation * fix and modify the GithubProfileScraper.java * fixes #961: add query in KaizenHarverster's queue to get older Tweets In case if the current timeline's query already has an until statement, replace it's date part with the oldest one. Also add DateFormat object in KaizenHarverster to parse Date into String of format yyyy-MM-dd. * fix eclipse classpath for storing classes (#1097) * Fixes #1123: Adding Gemnasium Button & Fixing Docker build button * Fix Codacy issue in Timeline.java. Related #1070 Link to codacy: https://www.codacy.com/app/sudheesh1995/loklak_server/file/6470204147/issues/source?bid=3495500&fileBranchId=3495500 Description: Fields should be declared at the top of the class * Fix Codacy issue for some files in org.loklak.server.api. Related #1070 * ConsoleService.java - Fields should be declared at the top of the class - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484902617/issues/source?bid=3495500&fileBranchId=3495500 * EventBriteCrawler.java - Make spacing consistent for conditionals * GraphServlet.java - Reduce complexity of doGet method - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484903642/issues/source?bid=3495500&fileBranchId=3495500 * Rename Dockerfile-learnings.md to docs/Dockerfile-learnings.md * fix #1138: Correct spelling mistake in README.md (#1140) Change "descripe" to "describe" in How to Contribute section. * Fixes #1123: Adding Gemnasium Button & Fixing Docker build button in rst file (#1137) * Related #1070: Fix Codacy issues for files in org.loklak.api.search (#1134) * EventBriteCrawlerService.java - Use one line for each declaration - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733425/issues/source?bid=3495500&fileBranchId=3495500 * GenericScraper.java - Indentation fix - New line before EOF * GithubProfileScraper.java - Remove trailing whitespaces * MeetupsCrawlerService.java - Use one line for each declaration - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733676/issues/source?bid=3495500&fileBranchId=3495500 * SearchServlet.java - Indentation fix * SuggestServlet.java - Position literals first in String comparisons - Fields should be declared at the top of the class - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733640/issues/source?bid=3495500&fileBranchId=3495500 * WeiboUserInfo.java - Switch statements should have a default label - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733550/issues/source?bid=3495500&fileBranchId=3495500 * Fixes #1139: Changed the URL (#1141) * Fixes #1139: Changed the URL * Fixes #1139: Changed the URL * Fix "Strings must use doublequote. (quotes)" Related to #1070 * Fix #1070: Strings must use doublequote. (quotes), no-use-before-define * Fix #1070: Strings must use doublequote. (quotes), no-use-before-define * Fixes #1070:Strings must use double quotes, no-use-before-define * Related to #1070:Strings must use double quotes, no-use-before-define * Related #1058: Add Kaizen harvester usage documentation (#1145) * Fix #1070: Strings must use doublequote. (quotes), no-use-before-define (#1121) * Fix "Strings must use doublequote. (quotes)" Related to #1070 * Fix #1070: Strings must use doublequote. (quotes), no-use-before-define * Fix #1070: Strings must use doublequote. (quotes), no-use-before-define * fix #1130: Make retries and back off parameter for backend push configurable (#1131) These variables can be set from config.properties by changing/defining caretaker.backendpush.retries and caretaker.backendpush.backoff respectively. * Fixes part of #1132: Add unit test to check TwitterScraper output (#1133) * convert markdown file to rst (#1142) * Merged development fixed conflict. * Improve code quality for org.loklak.geo.* * Related #1070: Improve code quality for org.loklak.api.admin.* (#1149) * Related to #1070: Improve code quality for org.loklak.Crawler.java * fix related to #1152: code refractoring for logging (#1153) * fix related to #1133: fix access specifiers (#1151) * fixes #1161: Add GCloud Kubernetes deployment document for loklak (#1162) * fixes #1146: Check for TwitterFactory before getting instance (#1147) * Related #1070: Fix Codacy issues for org.loklak.api.amazon.* (#1163) * fix #1143: Fix NumberException in YoutubeScraper (#1157) * Installation and Start on a user specified port (#1159) Solves issue: #925 * Fixes #1165: Fixed the QuoraProfileScraper and displaying profileImage * Related to #1112:Add filter for images, videos (#1164) * Related #1156: Make harvesting decision biased for Kaizen (#1158) A probability is chosen as queuries.size() / QUERIES_LIMIT, which is compared to a randomly chosen target probability and decision is taken accordingly. In case of no limit on the queue size, probability to harvest is set to 0.5. * Fixes #1167 GithubScraperService able to scrape user specific data (#1168) Fixes issue #1167.githubprofilescraper service now displays starred_url, number of starred repos,followers_url, number of followers, following_url, number of people following for a particuler user. * fixes #1114 Improve URL shortening service * Include all 30X HTTP response code while checking for redirect. * Use POST requests as fallback for GET requests - There are many cases (mostly https?://fb.me/*) when GET requests give status 400: Bad Request, while POST request works fine. The patch will allow to make an attemt for POST request for such cases and fetch the result. * Try to fetch URL from <meta/> tag in response body in case of non redirect status code. * Check the validity of URL shortening only once, and not for each intermediate URL. * Displays proper url to open loklak_server Solves issue: #1172 Displays proper localhost url in which loklak_server is running after the execution of bin/start.sh or bin/installation.sh with a "p" flag. Earlier the localhost url only displayed port 9000 at the end in case of bin/start.sh and concatenated the running port with 9000 in case of bin/stop.sh.Ex: http://localhost:9000 # bin/start.sh, actual port 8888 http://localhost:90008888 #bin/installation.sh, actual port 8888 * fixes #1177 - Added tests for WordpressCrawlerService.java fixes issue #1177. Added tests for WordpressCrawlerService.java and also removed the leading 'Author' from the author field in json output. * fix #1176: Fetch debug flag from config file Change configurations for TwitterScraper and ClientConnection * fixes #1184 - Instagram Profile Scraper is now working fixes issue #1184. Instagram scraper is now returning data. * fix #1179: Use java.net.URL to build relative URL in ClientConnection (#1183) * fixes #1070: Add test for URL unshortening (#1173) * fixes #1169 - Added test for Github profile scraper (#1185) fixes issue #1169, Added tests for GithubProfileScraper service. * Improve code quality for some files in org.loklak.api.cms and add checkstyle as gradle task (#1187) * Related #1070: Improve code quality for some files in org.loklak.api.cms Fixes are done using checkstyle with google_check.xml config and 4 space indentation level * Add checkstyle check as gradle task * Fixes #1191: NullPointerException in CareTaker.java (#1192) * Auto-generate docs in dev.loklak.org repository (#1195) * Fix #1171: Extract video URLs from IFrame (#1193) Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format. Also add org.unbescape as gradle dependency to unescape string in iframe. * FIx #1201: Break down KaizenHarvester into simpler pieces (#1203) Introduce KaizenQuery class to support different methods to store queries that Kaizen needs to process * Fix #1208: Add .editorconfig (#1209) * Fixes #1204 Add subtree if not already added (#1207) * Fix #1205: Extract complete video URLs for Tweets (#1206) This implementation mimics the video playback flow of mobile react app of Twitter. 1. Extract BEARER_TOKEN holding script's URL. 2. Extract guest session token. 3. Extract BEARER_TOKEN from URL in 1. 4. Make Twitter API call with the parameters. * fixes #1196 - Enhanced Quora profile scraper #1199 (#1200) Fixes issue #1196 The scraper now provides more information like university of user, location where user works, topics he knows, number of followers, number of questions, number of edits, number of blogs etc. * Fix #1188: Use unbescape to unescape HTML in html2utf8 (#1194) Also improve whitespace cleaning in the method. Move old implementation to html2utf8Custom. * Fixes #1097: Restore access specifiers in TwitterScraper.java (#1198) * Fix indentation (#1211) * Fix #1212: fix checkstyle errors(except missing javadoc) (#1218) * Fixes #1215 fix syntax error in the script (#1217) * Fix #1213: Include videos for testing TwitterScraper (#1221) * Fix 1216: Revert "Installation and Start on a user specified port (#1159)" (#1227) This reverts commit 1e0bcd5. Conflicts (resolved): bin/installation.sh bin/start.sh * Fixes #1202: Modify loggers in Loklak Server for testing (#1222) * Fixes #1219: Add UTC time in TimeAndDateService (#1220) * Fixes #1112: Add image, video filter constraints for cache (#1190) * Fixes #1236: Update Docs for get parameter (#1237) * Fixes #1226 Build error currently showing (#1228) * Fixes 1215 Fix relative link * Update git to work with subtree * Adding echo statements * Fix #1239: Correct flag values in config.properties * Fix #1238: Add PriorityQueue harvesting strategy (#1240) Also add score related to each Tweet based on retweet and favourite count. * Fix #1251: Correct test case for RedirectUnshortener (#1253) http://t.co/E3w7s2qdBT now points to http://www.mostviralfeed.com/what-lady-gaga-actually-looks-like instead of http://mostviralfeed.com/what-lady-gaga-actually-looks-like * Fix #1247: Add function to collect stats about all classes for a classifier (#1248) * Fix #1256: Add classifier.json endpoint to serve aggregated data (#1257) * refactoring to have the same naming as in susi_server * Fixes #1261: RedirectUnshortener link fix (#1262) * Fixes #1229, #1235, Related #1230: Setup of testable version (#1250) 1) setup post and basescraper 2) Setup quoraprofilescraper with basescraper and post * Fix #1259: Add function for time sensitive aggregation (#1260) * Fix #1271: Correct redirect link in test (#1272) * Fix #1266: Allow time based aggregation in /api/classifier.json (#1267) * Fix #1278: Correct typo in kaizen.md (#1279) * enhanced elasticsearch mapping * eclipse classpath to use same as gradle * removed unused imports * Fix #1268: Add function for aggregation based on country codes (#1270) Following operations are now possible - * All time aggregation for all countries * Time sensitive aggregation for all countries * All previous aggregations for selected countries * Fix #1273: Add Jacoco to provide coverage report in XML format (#1274) * Fixes 1284: Improve test cases for URL unshortener (#1285) * Setup post and basescraper with QuoraProfileScraper (#1249) * Setup of testable version setup post and basescraper * Related #1230, 1231, 1244: integrate Timeline2 with quorascraper * Configure ssh agent before push

* deploy button info for docker loklak#1001 * Fixes loklak#1045 : Replace the image logo in navigation bar with a text Fixes loklak#1045 : Replace the image logo in navigation bar with a text * Fixes loklak#1048: fix execution of method without query string * fix for latest twitter html change * add docker status badge This is related to issue loklak#1049 * update documentation path Problem: The documentation has moved. The links in the README are outdated Solution: insert the containing folder into the path of all links to docs * moved dockerfile to root folder loklak#1049 * changed travis build to new location * updated Dockerfile path for compose This is for issue loklak#1049. The pull request loklak#1050 is a precondition for this to make sense. Problem: the Dockerfile was moved to / Solution: adapt the path * get aggregations also with fresh requests from twitter with source=all * Fixes loklak#1060: Increase default Xmx value * Fixed loklak#1059 - Remove and Ignore .DS_Store * Fixes loklak#1067 - Tweet URL in README is broken * corrected heading "Where do I find the java?" ->"Where do I find the Java documentation?" * Using the note directive of sphinx See loklak#1042 * README.md upd, useful links added * Fixes loklak#1033, loklak_server README.md upd, links updated with link syntax * Move documentation site The documentation site is now moved to https://github.com/loklak/dev.loklak.org Closes loklak#1014 * fix username emoji in tweet * Fix unused imports in python files(codacy issue) Related to loklak#1070 * removed .DS_Store * added .DS_Store to gitignore * Fix use of Null in scala code Related to loklak#1070 * fixed scraper * Edited Readme * Add update trigger script for docs Closes loklak#1003 * Creating Volume for persistence while deploying via docker, fix loklak#1051 (loklak#1089) * Update Dockerfile * Update Dockerfile * Update docker-compose.yml * Update docker-compose.yml * Update docker-compose.yml * Update Dockerfile * Update Dockerfile * Update Dockerfile * Update Dockerfile * Update docker-compose.yml * Update docker-compose.yml * Update docker-compose.yml * Update docker-compose.yml * updated docker build badge I changed the url so github requests a new image. The build works. https://hub.docker.com/r/mariobehling/loklak/builds/ * Docker: Consistent Volume Path Problem: docker-compose volume path is not the same as the dockerfile volume path Solution: Set the docker-compose volume path to the dockerfile volume path You can view the correct path in the Dockerfile: https://github.com/loklak/loklak_server/blob/7a1f0378dc40ec25eec6083e43558a62408d84e8/Dockerfile#L38 I checked in the container: ``` bash-4.3# ls /loklak_server/ bin conf gradlew settings.gradle build data html src build.gradle gradle installation ssi bash-4.3# ls / bin lib proc srv var dev loklak_server root sys etc media run tmp home mnt sbin usr ``` the data directory exists and is filled within `/loklak_server` * .travis.yml: Add keys for dev.loklak.org Closes loklak#1091 * fix initGet * option to autodelete messages after one month from the main index * disabling feature introduced with 27272ee for issue loklak#919 The storage of the settings file caused that the settings file was broken. It blew up to a huge file, like $ ls -l customized_config.properties -rw-r--r-- 1 loklak loklak 251650030 Apr 10 19:08 customized_config.properties This is the main cause that loklak.org was down since this feature was introduced. * Fixes loklak#1099 : Changes the href link of the button download, install and extend * fix loklak#1056 - document how to start contributing (loklak#1063) * Added JS EventListener to resize dump iframe on load. Closes loklak#1101 * Add Unit Tests to Loklak Server (loklak#1098) * Add unit tests for TwitterScraper.java * Add data file to test JSONRandomAccessFileTest.java * set up unit tests build in loklak Server * fix changes requested and codacy issues * fixes scrollbar event * at the twitter scraper now use more readable version of assert, also fix bug with parse long in youtube scraper(fails on Long.parse method, because spaces are not removed), add unit test for youtube scrapper. * fix bug with youtube scrapper and add unit test for scraper * Fixes loklak#1103: Changed the URLs to the correct ones (loklak#1104) * Fixes loklak#1103: Changed the URLs to the correct ones * Fixes loklak#1108: Fixed the typos in documentation * fix and modify the GithubProfileScraper.java * fixes loklak#961: add query in KaizenHarverster's queue to get older Tweets In case if the current timeline's query already has an until statement, replace it's date part with the oldest one. Also add DateFormat object in KaizenHarverster to parse Date into String of format yyyy-MM-dd. * fix eclipse classpath for storing classes (loklak#1097) * Fixes loklak#1123: Adding Gemnasium Button & Fixing Docker build button * Fix Codacy issue in Timeline.java. Related loklak#1070 Link to codacy: https://www.codacy.com/app/sudheesh1995/loklak_server/file/6470204147/issues/source?bid=3495500&fileBranchId=3495500 Description: Fields should be declared at the top of the class * Fix Codacy issue for some files in org.loklak.server.api. Related loklak#1070 * ConsoleService.java - Fields should be declared at the top of the class - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484902617/issues/source?bid=3495500&fileBranchId=3495500 * EventBriteCrawler.java - Make spacing consistent for conditionals * GraphServlet.java - Reduce complexity of doGet method - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484903642/issues/source?bid=3495500&fileBranchId=3495500 * Rename Dockerfile-learnings.md to docs/Dockerfile-learnings.md * fix loklak#1138: Correct spelling mistake in README.md (loklak#1140) Change "descripe" to "describe" in How to Contribute section. * Fixes loklak#1123: Adding Gemnasium Button & Fixing Docker build button in rst file (loklak#1137) * Related loklak#1070: Fix Codacy issues for files in org.loklak.api.search (loklak#1134) * EventBriteCrawlerService.java - Use one line for each declaration - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733425/issues/source?bid=3495500&fileBranchId=3495500 * GenericScraper.java - Indentation fix - New line before EOF * GithubProfileScraper.java - Remove trailing whitespaces * MeetupsCrawlerService.java - Use one line for each declaration - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733676/issues/source?bid=3495500&fileBranchId=3495500 * SearchServlet.java - Indentation fix * SuggestServlet.java - Position literals first in String comparisons - Fields should be declared at the top of the class - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733640/issues/source?bid=3495500&fileBranchId=3495500 * WeiboUserInfo.java - Switch statements should have a default label - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733550/issues/source?bid=3495500&fileBranchId=3495500 * Fixes loklak#1139: Changed the URL (loklak#1141) * Fixes loklak#1139: Changed the URL * Fixes loklak#1139: Changed the URL * Fix "Strings must use doublequote. (quotes)" Related to loklak#1070 * Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define * Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define * Fixes loklak#1070:Strings must use double quotes, no-use-before-define * Related to loklak#1070:Strings must use double quotes, no-use-before-define * Related loklak#1058: Add Kaizen harvester usage documentation (loklak#1145) * Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define (loklak#1121) * Fix "Strings must use doublequote. (quotes)" Related to loklak#1070 * Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define * Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define * fix loklak#1130: Make retries and back off parameter for backend push configurable (loklak#1131) These variables can be set from config.properties by changing/defining caretaker.backendpush.retries and caretaker.backendpush.backoff respectively. * Fixes part of loklak#1132: Add unit test to check TwitterScraper output (loklak#1133) * convert markdown file to rst (loklak#1142) * Merged development fixed conflict. * Improve code quality for org.loklak.geo.* * Related loklak#1070: Improve code quality for org.loklak.api.admin.* (loklak#1149) * Related to loklak#1070: Improve code quality for org.loklak.Crawler.java * fix related to loklak#1152: code refractoring for logging (loklak#1153) * fix related to loklak#1133: fix access specifiers (loklak#1151) * fixes loklak#1161: Add GCloud Kubernetes deployment document for loklak (loklak#1162) * fixes loklak#1146: Check for TwitterFactory before getting instance (loklak#1147) * Related loklak#1070: Fix Codacy issues for org.loklak.api.amazon.* (loklak#1163) * fix loklak#1143: Fix NumberException in YoutubeScraper (loklak#1157) * Installation and Start on a user specified port (loklak#1159) Solves issue: loklak#925 * Fixes loklak#1165: Fixed the QuoraProfileScraper and displaying profileImage * Related to loklak#1112:Add filter for images, videos (loklak#1164) * Related loklak#1156: Make harvesting decision biased for Kaizen (loklak#1158) A probability is chosen as queuries.size() / QUERIES_LIMIT, which is compared to a randomly chosen target probability and decision is taken accordingly. In case of no limit on the queue size, probability to harvest is set to 0.5. * Fixes loklak#1167 GithubScraperService able to scrape user specific data (loklak#1168) Fixes issue loklak#1167.githubprofilescraper service now displays starred_url, number of starred repos,followers_url, number of followers, following_url, number of people following for a particuler user. * fixes loklak#1114 Improve URL shortening service * Include all 30X HTTP response code while checking for redirect. * Use POST requests as fallback for GET requests - There are many cases (mostly https?://fb.me/*) when GET requests give status 400: Bad Request, while POST request works fine. The patch will allow to make an attemt for POST request for such cases and fetch the result. * Try to fetch URL from <meta/> tag in response body in case of non redirect status code. * Check the validity of URL shortening only once, and not for each intermediate URL. * Displays proper url to open loklak_server Solves issue: loklak#1172 Displays proper localhost url in which loklak_server is running after the execution of bin/start.sh or bin/installation.sh with a "p" flag. Earlier the localhost url only displayed port 9000 at the end in case of bin/start.sh and concatenated the running port with 9000 in case of bin/stop.sh.Ex: http://localhost:9000 # bin/start.sh, actual port 8888 http://localhost:90008888 #bin/installation.sh, actual port 8888 * fixes loklak#1177 - Added tests for WordpressCrawlerService.java fixes issue loklak#1177. Added tests for WordpressCrawlerService.java and also removed the leading 'Author' from the author field in json output. * fix loklak#1176: Fetch debug flag from config file Change configurations for TwitterScraper and ClientConnection * fixes loklak#1184 - Instagram Profile Scraper is now working fixes issue loklak#1184. Instagram scraper is now returning data. * fix loklak#1179: Use java.net.URL to build relative URL in ClientConnection (loklak#1183) * fixes loklak#1070: Add test for URL unshortening (loklak#1173) * fixes loklak#1169 - Added test for Github profile scraper (loklak#1185) fixes issue loklak#1169, Added tests for GithubProfileScraper service. * Improve code quality for some files in org.loklak.api.cms and add checkstyle as gradle task (loklak#1187) * Related loklak#1070: Improve code quality for some files in org.loklak.api.cms Fixes are done using checkstyle with google_check.xml config and 4 space indentation level * Add checkstyle check as gradle task * Fixes loklak#1191: NullPointerException in CareTaker.java (loklak#1192) * Auto-generate docs in dev.loklak.org repository (loklak#1195) * Fix loklak#1171: Extract video URLs from IFrame (loklak#1193) Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format. Also add org.unbescape as gradle dependency to unescape string in iframe. * FIx loklak#1201: Break down KaizenHarvester into simpler pieces (loklak#1203) Introduce KaizenQuery class to support different methods to store queries that Kaizen needs to process * Fix loklak#1208: Add .editorconfig (loklak#1209) * Fixes loklak#1204 Add subtree if not already added (loklak#1207) * Fix loklak#1205: Extract complete video URLs for Tweets (loklak#1206) This implementation mimics the video playback flow of mobile react app of Twitter. 1. Extract BEARER_TOKEN holding script's URL. 2. Extract guest session token. 3. Extract BEARER_TOKEN from URL in 1. 4. Make Twitter API call with the parameters. * fixes loklak#1196 - Enhanced Quora profile scraper loklak#1199 (loklak#1200) Fixes issue loklak#1196 The scraper now provides more information like university of user, location where user works, topics he knows, number of followers, number of questions, number of edits, number of blogs etc. * Fix loklak#1188: Use unbescape to unescape HTML in html2utf8 (loklak#1194) Also improve whitespace cleaning in the method. Move old implementation to html2utf8Custom. * Fixes loklak#1097: Restore access specifiers in TwitterScraper.java (loklak#1198) * Fix indentation (loklak#1211) * Fix loklak#1212: fix checkstyle errors(except missing javadoc) (loklak#1218) * Fixes loklak#1215 fix syntax error in the script (loklak#1217) * Fix loklak#1213: Include videos for testing TwitterScraper (loklak#1221) * Fix 1216: Revert "Installation and Start on a user specified port (loklak#1159)" (loklak#1227) This reverts commit 1e0bcd5. Conflicts (resolved): bin/installation.sh bin/start.sh * Fixes loklak#1202: Modify loggers in Loklak Server for testing (loklak#1222) * Fixes loklak#1219: Add UTC time in TimeAndDateService (loklak#1220) * Fixes loklak#1112: Add image, video filter constraints for cache (loklak#1190) * Fixes loklak#1236: Update Docs for get parameter (loklak#1237) * Fixes loklak#1226 Build error currently showing (loklak#1228) * Fixes 1215 Fix relative link * Update git to work with subtree * Adding echo statements * Fix loklak#1239: Correct flag values in config.properties * Fix loklak#1238: Add PriorityQueue harvesting strategy (loklak#1240) Also add score related to each Tweet based on retweet and favourite count. * Fix loklak#1251: Correct test case for RedirectUnshortener (loklak#1253) http://t.co/E3w7s2qdBT now points to http://www.mostviralfeed.com/what-lady-gaga-actually-looks-like instead of http://mostviralfeed.com/what-lady-gaga-actually-looks-like * Fix loklak#1247: Add function to collect stats about all classes for a classifier (loklak#1248) * Fix loklak#1256: Add classifier.json endpoint to serve aggregated data (loklak#1257) * refactoring to have the same naming as in susi_server * Fixes loklak#1261: RedirectUnshortener link fix (loklak#1262) * Fixes loklak#1229, loklak#1235, Related loklak#1230: Setup of testable version (loklak#1250) 1) setup post and basescraper 2) Setup quoraprofilescraper with basescraper and post * Fix loklak#1259: Add function for time sensitive aggregation (loklak#1260) * Fix loklak#1271: Correct redirect link in test (loklak#1272) * Fix loklak#1266: Allow time based aggregation in /api/classifier.json (loklak#1267) * Fix loklak#1278: Correct typo in kaizen.md (loklak#1279) * enhanced elasticsearch mapping * eclipse classpath to use same as gradle * removed unused imports * Fix loklak#1268: Add function for aggregation based on country codes (loklak#1270) Following operations are now possible - * All time aggregation for all countries * Time sensitive aggregation for all countries * All previous aggregations for selected countries * Fix loklak#1273: Add Jacoco to provide coverage report in XML format (loklak#1274) * Fixes 1284: Improve test cases for URL unshortener (loklak#1285) * Setup post and basescraper with QuoraProfileScraper (loklak#1249) * Setup of testable version setup post and basescraper * Related loklak#1230, 1231, 1244: integrate Timeline2 with quorascraper * Configure ssh agent before push

* Setup of testable version setup post and basescraper * Related loklak#1230, 1231, 1244: integrate Timeline2 with quorascraper

vibhcool changed the title ~~Setup post and basescraper with QuoraProfileScraper~~ [WIP] : Setup post and basescraper with QuoraProfileScraper Jun 15, 2017

vibhcool changed the title ~~[WIP] : Setup post and basescraper with QuoraProfileScraper~~ Setup post and basescraper with QuoraProfileScraper Jun 15, 2017

vibhcool mentioned this pull request Jun 15, 2017

Add Post and BaseScraper to QuoraProfileScraper #1250

Merged

5 tasks

vibhcool force-pushed the 1231 branch from d575abc to b1fd864 Compare June 15, 2017 19:23

vibhcool mentioned this pull request Jun 15, 2017

Add Scraper for questions of QuoraScraper #1252

Closed

vibhcool force-pushed the 1231 branch from b1fd864 to bda4461 Compare June 16, 2017 04:47

kavithaenair reviewed Jun 16, 2017

View reviewed changes

vibhcool mentioned this pull request Jun 16, 2017

Fixes #1252: Add Scraper for questions of QuoraScraper #1255

Merged

5 tasks

sudheesh001 requested changes Jun 17, 2017

View reviewed changes

hemantjadon reviewed Jun 19, 2017

View reviewed changes

vibhcool force-pushed the 1231 branch 7 times, most recently from 3ac2401 to bcb0243 Compare June 22, 2017 17:53

vibhcool force-pushed the 1231 branch from bcb0243 to 4c4ea1f Compare June 22, 2017 18:15

Achint08 approved these changes Jun 22, 2017

View reviewed changes

singhpratyush reviewed Jun 23, 2017

View reviewed changes

Achint08 reviewed Jun 23, 2017

View reviewed changes

vibhcool mentioned this pull request Jun 23, 2017

[WIP] Modify TwitterTweet object to use PostTimeline iterator #1264

Closed

5 tasks

singhpratyush reviewed Jun 28, 2017

View reviewed changes

vibhcool mentioned this pull request Jun 28, 2017

Fixes #1276: Output data from TwitterTweet object as Post Object #1277

Merged

5 tasks

vibhcool force-pushed the 1231 branch from 4c4ea1f to 2e25a7b Compare June 28, 2017 19:27

hemantjadon approved these changes Jun 30, 2017

View reviewed changes

kavithaenair approved these changes Jun 30, 2017

View reviewed changes

vibhcool force-pushed the 1231 branch from 2e25a7b to 8d02da4 Compare July 1, 2017 12:51

Setup of testable version

e2440e3

setup post and basescraper

vibhcool force-pushed the 1231 branch from 027930e to 96b1615 Compare July 2, 2017 22:54

singhpratyush reviewed Jul 4, 2017

View reviewed changes

vibhcool force-pushed the 1231 branch from 96b1615 to f10d3ee Compare July 4, 2017 07:49

vibhcool force-pushed the 1231 branch 2 times, most recently from 4ad6c84 to 28efecf Compare July 4, 2017 08:24

Related loklak#1230, 1231, 1244: integrate Timeline2 with quorascraper

24911f1

vibhcool force-pushed the 1231 branch from 28efecf to 24911f1 Compare July 4, 2017 08:29

singhpratyush approved these changes Jul 4, 2017

View reviewed changes

mariobehling merged commit 02a5920 into loklak:development Jul 4, 2017

Setup post and basescraper with QuoraProfileScraper #1249

Setup post and basescraper with QuoraProfileScraper #1249

Conversation

vibhcool commented Jun 15, 2017 • edited

Short description

For the reviewers

vibhcool commented Jun 16, 2017

kavithaenair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sudheesh001 commented Jun 17, 2017

hemantjadon left a comment

Choose a reason for hiding this comment

vibhcool commented Jun 20, 2017

vibhcool commented Jun 22, 2017 • edited

Achint08 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpratyush Jun 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Achint08 Jun 23, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhcool Jun 28, 2017 • edited

Choose a reason for hiding this comment

singhpratyush Jun 29, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpratyush commented Jun 28, 2017

vibhcool commented Jun 28, 2017

SKrPl commented Jun 29, 2017

vibhcool commented Jun 29, 2017

kavithaenair left a comment

Choose a reason for hiding this comment

sudheesh001 commented Jun 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpratyush left a comment

Choose a reason for hiding this comment

codecov-io commented Jul 4, 2017 • edited

Codecov Report

singhpratyush left a comment

Choose a reason for hiding this comment

vibhcool commented Jun 15, 2017 •

edited

vibhcool commented Jun 22, 2017 •

edited

singhpratyush Jun 28, 2017 •

edited

Achint08 Jun 23, 2017 •

edited

vibhcool Jun 28, 2017 •

edited

singhpratyush Jun 29, 2017 •

edited

codecov-io commented Jul 4, 2017 •

edited