New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract video URLs from IFrame on Twitter search page #1193

Merged
merged 1 commit into from May 30, 2017

Conversation

Projects
None yet
6 participants
@singhpratyush
Member

singhpratyush commented May 26, 2017

Short description

Fixes #1171.

Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format.

Reference - youtube-dl's Twitter extractor.

Results for all possible resolution are shown here.

Example -

$ http "http://127.0.0.1:9000/api/search.json?q=video&source=twitter"
HTTP/1.1 200 OK
Content-Type: application/json;charset=utf-8
Date: Sat, 27 May 2017 02:49:12 GMT
Expires: Sat, 27 May 2017 02:49:14 GMT
Last-Modified: Sat, 27 May 2017 02:49:12 GMT
Server: Jetty(9.3.z-SNAPSHOT)
Set-Cookie: JSESSIONID=1d919deht2k0a18ds388pomut3;Path=/
Transfer-Encoding: chunked
X-Robots-Tag: noindex,noarchive,nofollow,nosnippet

{
    "readme_0": "THIS JSON IS THE RESULT OF YOUR SEARCH QUERY - THERE IS NO WEB PAGE WHICH SHOWS THE RESULT!",
    "readme_1": "loklak.org is the framework for a message search system, not the portal, read: http://loklak.org/about.html#notasearchportal",
    "readme_2": "This is supposed to be the back-end of a search portal. For the api, see http://loklak.org/api.html",
    "readme_3": "Parameters q=(query), source=(cache|backend|twitter|all), callback=p for jsonp, maximumRecords=(message count), minified=(true|false)",
    "search_metadata": {
        "cache_hits": 0,
        "client": "127.0.0.1",
        "count": "20",
        "count_backend": 0,
        "count_twitter_all": 0,
        "count_twitter_new": 20,
        "filter": "",
        "hits": 20,
        "maximumRecords": "20",
        "period": 3000,
        "query": "video",
        "scraperInfo": "local",
        "servicereduction": "false",
        "startRecord": "1",
        "time": 17549
    },
    "statuses": [
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956797153281",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/stevewilliam490/status/868297956797153281",
            "links": [
                "https://youtu.be/p5B6ASaIBSo"
            ],
            "links_count": 1,
            "location_mark": [
                -87.46889219857262,
                15.774493737645102
            ],
            "location_point": [
                -87.46730816898277,
                15.77425000992335
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "Honduras",
            "place_country_center": [
                -44.59999854424969,
                8.22500036015407
            ],
            "place_country_code": "HN",
            "place_id": "",
            "place_name": "Tela",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "stevewilliam490",
            "source_type": "TWITTER",
            "text": "3 HORAS DE SOM DE CHUVA E TROVOADA PARA DORMIR E RELAXAR TELA PRETA 3 HO... https://youtu.be/p5B6ASaIBSo via @YouTube",
            "text_length": 117,
            "timestamp": "2017-05-27T02:49:16.740Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:16.740Z",
                "appearance_latest": "2017-05-27T02:49:16.740Z",
                "name": "Aylton carvalho",
                "profile_image_url_https": "https://abs.twimg.com/sticky/default_profile_images/default_profile_bigger.png",
                "screen_name": "stevewilliam490",
                "user_id": "2419650132"
            },
            "videos": [
                "https://youtu.be/p5B6ASaIBSo"
            ],
            "videos_count": 1,
            "without_l_len": 88,
            "without_lu_len": 79,
            "without_luh_len": 79
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "www.youtube.com"
            ],
            "hosts_count": 1,
            "id_str": "868297956780163072",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/japan_3150/status/868297956780163072",
            "links": [
                "https://www.youtube.com/watch?v=k6ce3UXO7o4"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "japan_3150",
            "source_type": "TWITTER",
            "text": "【定期】沖縄左翼が一般市民を襲撃! 沖縄県警は、なぜ逮捕しない???ヘイトスピーチや軽犯罪法違反は日常茶飯事だろうが??? https://www.youtube.com/watch?v=k6ce3UXO7o4",
            "text_length": 105,
            "timestamp": "2017-05-27T02:49:16.740Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:16.740Z",
                "appearance_latest": "2017-05-27T02:49:16.740Z",
                "name": "河山源海@覚醒の鐘の音",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/815431708308713472/cMpQsqOn_bigger.jpg",
                "screen_name": "japan_3150",
                "user_id": "231692016"
            },
            "videos": [
                "https://www.youtube.com/watch?v=k6ce3UXO7o4"
            ],
            "videos_count": 1,
            "without_l_len": 61,
            "without_lu_len": 61,
            "without_luh_len": 61
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_language": "english",
            "classifier_language_probability": 9.410032648926592e-12,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956767780868",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/filmcourage/status/868297956767780868",
            "links": [
                "http://youtu.be/RtefQFZBUmA?a"
            ],
            "links_count": 1,
            "location_mark": [
                -81.88741984127522,
                31.60557210726796
            ],
            "location_point": [
                -81.88633729791565,
                31.607849007289786
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "United States",
            "place_country_center": [
                -83.27110293007973,
                35.6452903550329
            ],
            "place_country_code": "US",
            "place_id": "",
            "place_name": "Jesup",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "filmcourage",
            "source_type": "TWITTER",
            "text": "Can A Bad Writer Become Great? by Lee Jessup: http://youtu.be/RtefQFZBUmA?a via @YouTube",
            "text_length": 88,
            "timestamp": "2017-05-27T02:49:16.742Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:16.742Z",
                "appearance_latest": "2017-05-27T02:49:16.742Z",
                "name": "Film Courage",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/757464702662217728/UdUFMbGD_bigger.jpg",
                "screen_name": "filmcourage",
                "user_id": "45083495"
            },
            "videos": [
                "http://youtu.be/RtefQFZBUmA?a"
            ],
            "videos_count": 1,
            "without_l_len": 58,
            "without_lu_len": 49,
            "without_luh_len": 49
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_emotion": "fear",
            "classifier_emotion_probability": 4.692500112923881e-07,
            "classifier_language": "english",
            "classifier_language_probability": 9.73710712059983e-09,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [
                "askethanandgrayson"
            ],
            "hashtags_count": 1,
            "hosts": [],
            "hosts_count": 0,
            "id_str": "868297956713263105",
            "images": [
                "https://abs.twimg.com/emoji/v2/72x72/1f940.png"
            ],
            "images_count": 1,
            "link": "https://twitter.com/adorbsdolan/status/868297956713263105",
            "links": [],
            "links_count": 0,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "adorbsdolan",
            "source_type": "TWITTER",
            "text": "show us the 69th video or picture in your memories #AskEthanAndGrayson",
            "text_length": 70,
            "timestamp": "2017-05-27T02:49:16.744Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:16.744Z",
                "appearance_latest": "2017-05-27T02:49:16.744Z",
                "name": "izzy 🥀",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/867493294049198082/rMDUR0_e_bigger.jpg",
                "screen_name": "adorbsdolan",
                "user_id": "3700315941"
            },
            "videos": [],
            "videos_count": 0,
            "without_l_len": 70,
            "without_lu_len": 70,
            "without_luh_len": 50
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956625088512",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/marvynmesquita/status/868297956625088512",
            "links": [
                "http://youtu.be/CVElCdD4rtw?aWind"
            ],
            "links_count": 1,
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "marvynmesquita",
            "source_type": "TWITTER",
            "text": "Gostei de um vídeo @YouTube http://youtu.be/CVElCdD4rtw?aWind River (2017) - Trailer Legendado 🎬",
            "text_length": 97,
            "timestamp": "2017-05-27T02:49:16.749Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:16.749Z",
                "appearance_latest": "2017-05-27T02:49:16.749Z",
                "name": "Trojanº¹",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/473198381343731712/wEzjN4QR_bigger.jpeg",
                "screen_name": "marvynmesquita",
                "user_id": "69683575"
            },
            "videos": [
                "http://youtu.be/CVElCdD4rtw?aWind"
            ],
            "videos_count": 1,
            "without_l_len": 63,
            "without_lu_len": 54,
            "without_luh_len": 54
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_language": "dutch",
            "classifier_language_probability": 0.0014894634950906038,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "pic.twitter.com"
            ],
            "hosts_count": 1,
            "id_str": "868297956583063552",
            "images": [
                "https://pbs.twimg.com/ext_tw_video_thumb/868297907140567040/pu/img/469c3UloHz0MHP5P.jpg",
                "https://pic.twitter.com/H37BWKClTR"
            ],
            "images_count": 2,
            "link": "https://twitter.com/Byulyoda/status/868297956583063552",
            "links": [
                "https://pic.twitter.com/H37BWKClTR"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "Byulyoda",
            "source_type": "TWITTER",
            "text": "나는 목소리로 사람을 죽이지 https://pic.twitter.com/H37BWKClTR",
            "text_length": 50,
            "timestamp": "2017-05-27T02:49:23.537Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.537Z",
                "appearance_latest": "2017-05-27T02:49:23.537Z",
                "name": "*:・゚✧꧁༺별༒요다༻꧂✧゚・: *",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/866864213817470976/l2ITcMAg_bigger.jpg",
                "screen_name": "Byulyoda",
                "user_id": "3278079985"
            },
            "videos": [
                "https://video.twimg.com/ext_tw_video/868297907140567040/pu/vid/0/3000/240x240/BmGuMlZYNKoxigfh.ts",
                "https://video.twimg.com/ext_tw_video/868297907140567040/pu/vid/3000/7487/240x240/j7r6rnYmbxVJyi5G.ts",
                "https://video.twimg.com/ext_tw_video/868297907140567040/pu/vid/0/3000/480x480/JyB2Mwu7fRBmQcpw.ts",
                "https://video.twimg.com/ext_tw_video/868297907140567040/pu/vid/3000/7487/480x480/n0hJaDYEqB9u5hxm.ts",
                "https://video.twimg.com/ext_tw_video/868297907140567040/pu/vid/0/3000/720x720/M4u0HOZvdbZ7_S3h.ts",
                "https://video.twimg.com/ext_tw_video/868297907140567040/pu/vid/3000/7487/720x720/AopLVgfeFVWMGedw.ts"
            ],
            "videos_count": 6,
            "without_l_len": 15,
            "without_lu_len": 15,
            "without_luh_len": 15
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_language": "spanish",
            "classifier_language_probability": 1.1286945664323866e-05,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [],
            "hosts_count": 0,
            "id_str": "868297956478341121",
            "images": [
                "https://abs.twimg.com/emoji/v2/72x72/1f338.png"
            ],
            "images_count": 1,
            "link": "https://twitter.com/GabbyFonsk/status/868297956478341121",
            "links": [],
            "links_count": 0,
            "location_mark": [
                110.83048088938347,
                -7.557989795774432
            ],
            "location_point": [
                110.83167266605034,
                -7.556109944151785
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "Indonesia",
            "place_country_center": [
                -70.35906213352413,
                8.030259873732106
            ],
            "place_country_code": "ID",
            "place_id": "",
            "place_name": "Surakarta",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "GabbyFonsk",
            "source_type": "TWITTER",
            "text": "😂😂😂 todo mundo vio el video, solo 👀👀",
            "text_length": 41,
            "timestamp": "2017-05-27T02:49:23.538Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.538Z",
                "appearance_latest": "2017-05-27T02:49:23.538Z",
                "name": "Gaby 🌸",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/866485455906361344/dT-fFj5i_bigger.jpg",
                "screen_name": "GabbyFonsk",
                "user_id": "802868744435486720"
            },
            "videos": [],
            "videos_count": 0,
            "without_l_len": 41,
            "without_lu_len": 41,
            "without_luh_len": 41
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "www.youtube.com"
            ],
            "hosts_count": 1,
            "id_str": "868297956398698501",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/Miguel2k/status/868297956398698501",
            "links": [
                "https://www.youtube.com/watch?v=Q5DTBGU8GOE"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "Miguel2k",
            "source_type": "TWITTER",
            "text": "😬 https://www.youtube.com/watch?v=Q5DTBGU8GOE",
            "text_length": 46,
            "timestamp": "2017-05-27T02:49:23.540Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.540Z",
                "appearance_latest": "2017-05-27T02:49:23.540Z",
                "name": "Miguel López Ley",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/1520892057/25052009185_bigger.jpg",
                "screen_name": "Miguel2k",
                "user_id": "108360478"
            },
            "videos": [
                "https://www.youtube.com/watch?v=Q5DTBGU8GOE"
            ],
            "videos_count": 1,
            "without_l_len": 2,
            "without_lu_len": 2,
            "without_luh_len": 2
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "www.youtube.com"
            ],
            "hosts_count": 1,
            "id_str": "868297956390182912",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/dhitka/status/868297956390182912",
            "links": [
                "https://www.youtube.com/watch?v=3KIEQuCdNZ8&sns=fb"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "dhitka",
            "source_type": "TWITTER",
            "text": "https://www.youtube.com/watch?v=3KIEQuCdNZ8&sns=fb",
            "text_length": 50,
            "timestamp": "2017-05-27T02:49:23.541Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.541Z",
                "appearance_latest": "2017-05-27T02:49:23.541Z",
                "name": "DHITKA PRASTAMA",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/860892369436016640/7z3tohVv_bigger.jpg",
                "screen_name": "dhitka",
                "user_id": "2196345631"
            },
            "videos": [
                "https://www.youtube.com/watch?v=3KIEQuCdNZ8&sns=fb"
            ],
            "videos_count": 1,
            "without_l_len": 0,
            "without_lu_len": 0,
            "without_luh_len": 0
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956298031104",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/Dede_Pf/status/868297956298031104",
            "links": [
                "http://youtu.be/XlCpZyQPP4Y?aINJUSTICE"
            ],
            "links_count": 1,
            "location_mark": [
                42.87387565829874,
                41.61786284616791
            ],
            "location_point": [
                42.87223805807167,
                41.61558139213153
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "Georgia",
            "place_country_center": [
                -23.138334198453606,
                21.69055555560186
            ],
            "place_country_code": "GE",
            "place_id": "",
            "place_name": "Vale",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "Dede_Pf",
            "source_type": "TWITTER",
            "text": "Gostei de um vídeo @YouTube http://youtu.be/XlCpZyQPP4Y?aINJUSTICE 2 : Vale ou não a pena jogar",
            "text_length": 95,
            "timestamp": "2017-05-27T02:49:23.541Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.541Z",
                "appearance_latest": "2017-05-27T02:49:23.541Z",
                "name": "Adenilson Lopes",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/826670002069700609/ToM-Dqd9_bigger.jpg",
                "screen_name": "Dede_Pf",
                "user_id": "52869866"
            },
            "videos": [
                "http://youtu.be/XlCpZyQPP4Y?aINJUSTICE"
            ],
            "videos_count": 1,
            "without_l_len": 56,
            "without_lu_len": 47,
            "without_luh_len": 47
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956255973376",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/TYkcLX6fMNKoI4f/status/868297956255973376",
            "links": [
                "https://youtu.be/PIh2xe4jnpk"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "TYkcLX6fMNKoI4f",
            "source_type": "TWITTER",
            "text": "https://youtu.be/PIh2xe4jnpk",
            "text_length": 28,
            "timestamp": "2017-05-27T02:49:23.543Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.543Z",
                "appearance_latest": "2017-05-27T02:49:23.543Z",
                "name": "รัตนาพร อินอำนวย",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/867651819190747138/_vOZl3IH_bigger.jpg",
                "screen_name": "TYkcLX6fMNKoI4f",
                "user_id": "867649473530388480"
            },
            "videos": [
                "https://youtu.be/PIh2xe4jnpk"
            ],
            "videos_count": 1,
            "without_l_len": 0,
            "without_lu_len": 0,
            "without_luh_len": 0
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956218335232",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/saleemmhra666/status/868297956218335232",
            "links": [
                "https://youtu.be/noMJlowiZc8"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "saleemmhra666",
            "source_type": "TWITTER",
            "text": "https://youtu.be/noMJlowiZc8",
            "text_length": 28,
            "timestamp": "2017-05-27T02:49:23.543Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.543Z",
                "appearance_latest": "2017-05-27T02:49:23.543Z",
                "name": "Khawa",
                "profile_image_url_https": "https://abs.twimg.com/sticky/default_profile_images/default_profile_bigger.png",
                "screen_name": "saleemmhra666",
                "user_id": "868297729360973824"
            },
            "videos": [
                "https://youtu.be/noMJlowiZc8"
            ],
            "videos_count": 1,
            "without_l_len": 0,
            "without_lu_len": 0,
            "without_luh_len": 0
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956092522496",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/JuicyMichaelhea/status/868297956092522496",
            "links": [
                "http://youtu.be/bouMQPIaU-I?aA"
            ],
            "links_count": 1,
            "location_mark": [
                -75.70829138945116,
                45.44069180801908
            ],
            "location_point": [
                -75.69812017292628,
                45.41117084464577
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "Canada",
            "place_country_center": [
                -69.71663669669844,
                35.23458095511168
            ],
            "place_country_code": "CA",
            "place_id": "",
            "place_name": "Ottawa",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "JuicyMichaelhea",
            "source_type": "TWITTER",
            "text": "Gostei de um vídeo @YouTube http://youtu.be/bouMQPIaU-I?aA mala é falsa - Felípe Araujo ( Cover Lorrana Veras)",
            "text_length": 110,
            "timestamp": "2017-05-27T02:49:23.544Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.544Z",
                "appearance_latest": "2017-05-27T02:49:23.544Z",
                "name": "JuicyMichaelheat",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/653066252630511616/vRPJtEf9_bigger.png",
                "screen_name": "JuicyMichaelhea",
                "user_id": "2613099066"
            },
            "videos": [
                "http://youtu.be/bouMQPIaU-I?aA"
            ],
            "videos_count": 1,
            "without_l_len": 79,
            "without_lu_len": 70,
            "without_luh_len": 70
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_emotion": "sadness",
            "classifier_emotion_probability": 3.5973439782566174e-09,
            "classifier_profanity": "swear",
            "classifier_profanity_probability": 1.6361487720217838e-09,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956042133508",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/GGAllinTV/status/868297956042133508",
            "links": [
                "http://youtu.be/Gpnk2W7U0JU?a"
            ],
            "links_count": 1,
            "location_mark": [
                26.04762219626565,
                57.78228872593425
            ],
            "location_point": [
                26.04730030803347,
                57.77780899674528
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "Estonia",
            "place_country_center": [
                -14.095140436708519,
                29.78819461529524
            ],
            "place_country_code": "EE",
            "place_id": "",
            "place_name": "Valga",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "GGAllinTV",
            "source_type": "TWITTER",
            "text": "I liked a @YouTube video http://youtu.be/Gpnk2W7U0JU?a j walk scary ocean freestyle prod cat soup qy SopFb eM",
            "text_length": 109,
            "timestamp": "2017-05-27T02:49:23.545Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.545Z",
                "appearance_latest": "2017-05-27T02:49:23.545Z",
                "name": "GG Allin",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/862522608847388672/G9eki57T_bigger.jpg",
                "screen_name": "GGAllinTV",
                "user_id": "862521974492454912"
            },
            "videos": [
                "http://youtu.be/Gpnk2W7U0JU?a"
            ],
            "videos_count": 1,
            "without_l_len": 79,
            "without_lu_len": 70,
            "without_luh_len": 70
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_emotion": "sadness",
            "classifier_emotion_probability": 1.1593398463460858e-09,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297956000243712",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/ZueirosdoAlem/status/868297956000243712",
            "links": [
                "http://youtu.be/Dek452jjPnQ?aA"
            ],
            "links_count": 1,
            "location_mark": [
                -75.70000917819658,
                45.40811922722547
            ],
            "location_point": [
                -75.69812017292628,
                45.41117084464577
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "Canada",
            "place_country_center": [
                -69.71663669669844,
                35.23458095511168
            ],
            "place_country_code": "CA",
            "place_id": "",
            "place_name": "Ottawa",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "ZueirosdoAlem",
            "source_type": "TWITTER",
            "text": "Adicionei um vídeo a uma playlist @YouTube http://youtu.be/Dek452jjPnQ?aA Idade da Terra e os Rádio-Halos de Polônio",
            "text_length": 116,
            "timestamp": "2017-05-27T02:49:23.546Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.546Z",
                "appearance_latest": "2017-05-27T02:49:23.546Z",
                "name": "ZUEIROS DO ALEM",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/782452740622520320/rnhR1FFJ_bigger.jpg",
                "screen_name": "ZueirosdoAlem",
                "user_id": "782440858075459584"
            },
            "videos": [
                "http://youtu.be/Dek452jjPnQ?aA"
            ],
            "videos_count": 1,
            "without_l_len": 85,
            "without_lu_len": 76,
            "without_luh_len": 76
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_language": "english",
            "classifier_language_probability": 1.1353589390931647e-15,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be",
                "pic.twitter.com"
            ],
            "hosts_count": 2,
            "id_str": "868297955954110465",
            "images": [
                "https://pbs.twimg.com/media/DAzQYorXsAIrDcp.jpg",
                "https://pic.twitter.com/W7CLjoJODX"
            ],
            "images_count": 2,
            "link": "https://twitter.com/AnymoreVN/status/868297955954110465",
            "links": [
                "https://youtu.be/28gJNRopdcE",
                "https://pic.twitter.com/W7CLjoJODX"
            ],
            "links_count": 2,
            "location_mark": [
                -38.98039287915756,
                -12.241403673832021
            ],
            "location_point": [
                -38.96667099509699,
                -12.266670461868273
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "Brazil",
            "place_country_center": [
                -36.44791416192796,
                18.255414384536166
            ],
            "place_country_code": "BR",
            "place_id": "",
            "place_name": "Feira de Santana",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "AnymoreVN",
            "source_type": "TWITTER",
            "text": "Tem gente fazendo live de Sexta Feira 13 Conferindo o game: Friday The 13th The Game https://youtu.be/28gJNRopdcE https://pic.twitter.com/W7CLjoJODX",
            "text_length": 148,
            "timestamp": "2017-05-27T02:49:23.547Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.547Z",
                "appearance_latest": "2017-05-27T02:49:23.547Z",
                "name": "ɑny",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/868263884037251072/uPSelzAP_bigger.jpg",
                "screen_name": "AnymoreVN",
                "user_id": "3514292662"
            },
            "videos": [
                "https://youtu.be/28gJNRopdcE"
            ],
            "videos_count": 1,
            "without_l_len": 84,
            "without_lu_len": 84,
            "without_luh_len": 84
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_language": "spanish",
            "classifier_language_probability": 2.099857780197981e-15,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "www.youtube.com"
            ],
            "hosts_count": 1,
            "id_str": "868297955815686148",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/onelovesophia/status/868297955815686148",
            "links": [
                "https://www.youtube.com/watch?v=Vhwpu8WWaqg&feature=share"
            ],
            "links_count": 1,
            "location_mark": [
                -118.24509779160243,
                34.053622196108115
            ],
            "location_point": [
                -118.24368319392377,
                34.05223074092166
            ],
            "location_radius": 0,
            "location_source": "ANNOTATION",
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_country": "United States",
            "place_country_center": [
                -83.27110293007973,
                35.6452903550329
            ],
            "place_country_code": "US",
            "place_id": "",
            "place_name": "Los Angeles",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "onelovesophia",
            "source_type": "TWITTER",
            "text": "Sophia Abrahão | VLOG VIAGEM PARA LAS VEGAS E LOS ANGELES https://www.youtube.com/watch?v=Vhwpu8WWaqg&feature=share",
            "text_length": 115,
            "timestamp": "2017-05-27T02:49:23.563Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.563Z",
                "appearance_latest": "2017-05-27T02:49:23.563Z",
                "name": "Bia",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/862489169762938881/QXCadsN__bigger.jpg",
                "screen_name": "onelovesophia",
                "user_id": "702272623"
            },
            "videos": [
                "https://www.youtube.com/watch?v=Vhwpu8WWaqg&feature=share"
            ],
            "videos_count": 1,
            "without_l_len": 57,
            "without_lu_len": 57,
            "without_luh_len": 57
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "classifier_emotion": "sadness",
            "classifier_emotion_probability": 7.994097828145641e-09,
            "classifier_profanity": "swear",
            "classifier_profanity_probability": 3.6358862587348995e-09,
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "youtu.be"
            ],
            "hosts_count": 1,
            "id_str": "868297955765350401",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/OfficialMT_XD/status/868297955765350401",
            "links": [
                "http://youtu.be/D5D3QjSfaRQ?a"
            ],
            "links_count": 1,
            "mentions": [
                "YouTube"
            ],
            "mentions_count": 1,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "OfficialMT_XD",
            "source_type": "TWITTER",
            "text": "I liked a @YouTube video http://youtu.be/D5D3QjSfaRQ?a -Minecraft Live- Hypixel Games .:Mic On:.",
            "text_length": 96,
            "timestamp": "2017-05-27T02:49:23.566Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:23.566Z",
                "appearance_latest": "2017-05-27T02:49:23.566Z",
                "name": "Dakota Fravell",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/591573910144978944/LaiQGH-W_bigger.png",
                "screen_name": "OfficialMT_XD",
                "user_id": "3008957721"
            },
            "videos": [
                "http://youtu.be/D5D3QjSfaRQ?a"
            ],
            "videos_count": 1,
            "without_l_len": 66,
            "without_lu_len": 57,
            "without_luh_len": 57
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [],
            "hashtags_count": 0,
            "hosts": [
                "pic.twitter.com"
            ],
            "hosts_count": 1,
            "id_str": "868297955761020932",
            "images": [
                "https://pbs.twimg.com/ext_tw_video_thumb/868297845580865537/pu/img/zsTKkX6S7XsU5GM5.jpg",
                "https://pic.twitter.com/2i1eiCKur8"
            ],
            "images_count": 2,
            "link": "https://twitter.com/kiiiiirin_/status/868297955761020932",
            "links": [
                "https://pic.twitter.com/2i1eiCKur8"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "kiiiiirin_",
            "source_type": "TWITTER",
            "text": "ぼちぼち笑 https://pic.twitter.com/2i1eiCKur8",
            "text_length": 40,
            "timestamp": "2017-05-27T02:49:29.741Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:29.741Z",
                "appearance_latest": "2017-05-27T02:49:29.741Z",
                "name": "しょうこ",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/865114470955098112/Ofjhprer_bigger.jpg",
                "screen_name": "kiiiiirin_",
                "user_id": "1487598457"
            },
            "videos": [
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/0/3000/180x320/zKz501BrGW_2ijFx.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/3000/6000/180x320/u3NrP59qz9Muk5k9.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/6000/7600/180x320/vfP-nITVX7DzuZiB.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/0/3000/360x640/6HwTT1t_o6FEf_uW.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/3000/6000/360x640/NJCUFrUYAI8a6Ldc.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/6000/7600/360x640/z6Z8zY6pmchCflAk.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/0/3000/720x1280/U4z8yEtH81xuthm8.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/3000/6000/720x1280/MJmuvCJ0Tfcrl2Th.ts",
                "https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/6000/7600/720x1280/xOIujzLq5z5-k7K7.ts"
            ],
            "videos_count": 9,
            "without_l_len": 5,
            "without_lu_len": 5,
            "without_luh_len": 5
        },
        {
            "audio": [],
            "audio_count": 0,
            "canonical_id": "",
            "created_at": "2017-05-27T02:48:59.000Z",
            "favourites_count": 0,
            "hashtags": [
                "1_رمضان"
            ],
            "hashtags_count": 1,
            "hosts": [
                "www.youtube.com"
            ],
            "hosts_count": 1,
            "id_str": "868297955727597569",
            "images": [],
            "images_count": 0,
            "link": "https://twitter.com/treen789/status/868297955727597569",
            "links": [
                "https://www.youtube.com/watch?v=vpHZ8_Y2EpA"
            ],
            "links_count": 1,
            "mentions": [],
            "mentions_count": 0,
            "parent": "",
            "place_context": "ABOUT",
            "place_id": "",
            "place_name": "",
            "provider_type": "SCRAPED",
            "retweet_count": 0,
            "screen_name": "treen789",
            "source_type": "TWITTER",
            "text": "هل ستكون غنيأ أم فقيراً في المستقبل إكتشف ذلك الآن ! https://www.youtube.com/watch?v=vpHZ8_Y2EpA v #1_رمضان",
            "text_length": 107,
            "timestamp": "2017-05-27T02:49:29.743Z",
            "unshorten": {},
            "user": {
                "appearance_first": "2017-05-27T02:49:29.743Z",
                "appearance_latest": "2017-05-27T02:49:29.743Z",
                "name": "_نوف _",
                "profile_image_url_https": "https://pbs.twimg.com/profile_images/868157411999264772/ZeWhRMH9_bigger.jpg",
                "screen_name": "treen789",
                "user_id": "704899885661884416"
            },
            "videos": [
                "https://www.youtube.com/watch?v=vpHZ8_Y2EpA"
            ],
            "videos_count": 1,
            "without_l_len": 63,
            "without_lu_len": 63,
            "without_luh_len": 54
        }
    ]
}

I have:

  • There is a corresponding issue for this pull request.
  • Mentioned the Issue number in the pull request commit message Fixes #<number> commit message
  • There is only strictly only one commit per issue.

For the reviewers

I have:

  • Reviewed this pull request by an authorized contributor.
  • The reviewer is assigned to the pull request.

@singhpratyush singhpratyush changed the title from Extract video URLs from IFrame to Extract video URLs from IFrame on Twitter search page May 27, 2017

@singhpratyush

This comment has been minimized.

@djmgit

djmgit approved these changes May 27, 2017

Tested it.Results are showing up. Looks good to me.

@vibhcool

This comment has been minimized.

Member

vibhcool commented May 27, 2017

@singhpratyush , In the result, one video outputs as number of frames of video, before output, please append it into one video file

@singhpratyush

This comment has been minimized.

Member

singhpratyush commented May 28, 2017

@vibhcool: That is not as simple as it sounds. The links are of the following format -
https://video.twimg.com/ext_tw_video/<some_id>/pu/vid/<start_time>/<end_time>/<height>x<width>/<some_string>.ts

So, the videos are broken down into smaller .ts files and served as requested. If we wish to provide single file for a complete video, we will be facing two major issues -
1. Merging the files: To merge these files, we will require external tools like ffmpeg. The task of merging these files is slow and requires significant computing power.
2. Serving the merged file: Again, it will be too much load to serve video files.

If we serve .ts or the intermediate .m3u8 files, we can find many libraries that can play this format on different platforms (worth a mention - dash.js).

@vibhcool

This comment has been minimized.

Member

vibhcool commented May 28, 2017

@singhpratyush getting multiple video frames doesn't seems useful, but twitterdownloader outputs videos in mp4 format(in any video format, it doesn't matters),
like this :- for https://twitter.com/kiiiiirin_/status/868297955761020932
video :- https://video.twimg.com/ext_tw_video/868297845580865537/pu/vid/360x640/3gcixk28H9KOKzUL.mp4

@singhpratyush

This comment has been minimized.

Member

singhpratyush commented May 29, 2017

I couldn't find a way to get the direct mp4 link. I have looked for all possible links that are there with the Tweet but none of them was relevant.

I strongly believe that the websites that provide such links use the API to do it, instead of scraping. In any case, I have seen people mentioning about using mobile user agents for this purpose, I'll give it a try and post update here.

@SKrPl

This comment has been minimized.

Member

SKrPl commented May 29, 2017

Looks good to me, the output is as expected. Good work 😄 @singhpratyush .

@mariobehling

This comment has been minimized.

Member

mariobehling commented May 29, 2017

@singhpratyush Could you add these improvements to the Scraper JS library please as well?
https://github.com/fossasia/loklak_scraper_js

@singhpratyush

This comment has been minimized.

Member

singhpratyush commented May 30, 2017

@mariobehling: This patch still needs improvements. Will update there once this gets merged.

Fix #1171: Extract video URLs from IFrame
Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format.

Also add org.unbescape as gradle dependency to unescape string in iframe.

@mariobehling mariobehling merged commit a6bf175 into loklak:development May 30, 2017

2 checks passed

codacy/pr Good work! A positive pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

mariobehling added a commit that referenced this pull request Jul 5, 2017

Update Master with Development branch (#1281)
* deploy button info for docker #1001

* Fixes #1045 : Replace the image logo in navigation bar with a text


Fixes #1045 : Replace the image logo in navigation bar with a text

* Fixes #1048: fix execution of method without query string

* fix for latest twitter html change

* add docker status badge

This is related to issue #1049

* update documentation path

Problem: The documentation has moved. The links in the README are outdated
Solution: insert the containing folder into the path of all links to docs

* moved dockerfile to root folder #1049

* changed travis build to new location

* updated Dockerfile path for compose

This is for issue #1049.
The pull request #1050 is a precondition for this to make sense.

Problem: the Dockerfile was moved to /
Solution: adapt the path

* get aggregations also with fresh requests from twitter with source=all

* Fixes #1060: Increase default Xmx value

* Fixed #1059 - Remove and Ignore .DS_Store

* Fixes #1067 - Tweet URL in README is broken

* corrected heading

"Where do I find the java?" ->"Where do I find the Java documentation?"

* Using the note directive of sphinx

See #1042

* README.md upd, useful links added

* Fixes #1033, loklak_server README.md upd, links updated with link syntax

* Move documentation site

The documentation site is now moved to https://github.com/loklak/dev.loklak.org

Closes #1014

* fix username emoji in tweet

* Fix unused imports in python files(codacy issue)

Related to #1070

* removed .DS_Store

* added .DS_Store to gitignore

* Fix use of Null in scala code

Related to #1070

* fixed scraper

* Edited Readme

* Add update trigger script for docs

Closes #1003

* Creating Volume for persistence while deploying via docker, fix #1051 (#1089)

* Update Dockerfile

* Update Dockerfile

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* updated docker build badge

I changed the url so github requests a new image.
The build works.
https://hub.docker.com/r/mariobehling/loklak/builds/

* Docker: Consistent Volume Path

Problem: docker-compose volume path is not the same as the dockerfile volume path
Solution: Set the docker-compose volume path to the dockerfile volume path

You can view the correct path in the Dockerfile:
https://github.com/loklak/loklak_server/blob/7a1f0378dc40ec25eec6083e43558a62408d84e8/Dockerfile#L38
I checked in the container:
```
bash-4.3# ls /loklak_server/
bin              conf             gradlew          settings.gradle
build            data             html             src
build.gradle     gradle           installation     ssi
bash-4.3# ls /
bin            lib            proc           srv            var
dev            loklak_server  root           sys
etc            media          run            tmp
home           mnt            sbin           usr
```
the data directory exists and is filled within `/loklak_server`

* .travis.yml: Add keys for dev.loklak.org

Closes #1091

* fix initGet

* option to autodelete messages after one month from the main index

* disabling feature introduced with
27272ee
for issue #919

The storage of the settings file caused that the settings file was
broken. It blew up to a huge file, like
$ ls -l customized_config.properties
-rw-r--r-- 1 loklak loklak 251650030 Apr 10 19:08
customized_config.properties

This is the main cause that loklak.org was down since this feature was
introduced.

* Fixes #1099 : Changes the href link of the button download, install and extend

* fix #1056 - document how to start contributing (#1063)

* Added JS EventListener to resize dump iframe on load. Closes #1101

* Add Unit Tests to Loklak Server (#1098)

* Add unit tests for TwitterScraper.java

* Add data file to test JSONRandomAccessFileTest.java

* set up unit tests build in loklak Server

* fix changes requested and codacy issues

* fixes scrollbar event

* at the twitter scraper now use more readable version of assert, also fix bug with parse long in youtube scraper(fails on Long.parse method, because spaces are not removed), add unit test for youtube scrapper.

* fix bug with youtube scrapper and add unit test for scraper

* Fixes #1103: Changed the URLs to the correct ones (#1104)

* Fixes #1103: Changed the URLs to the correct ones

* Fixes #1108: Fixed the typos in documentation

* fix and modify the GithubProfileScraper.java

* fixes #961: add query in KaizenHarverster's queue to get older Tweets

In case if the current timeline's query already has an until statement, replace it's date part with the oldest one. Also add DateFormat object in KaizenHarverster to parse Date into String of format yyyy-MM-dd.

* fix eclipse classpath for storing classes (#1097)

* Fixes #1123: Adding Gemnasium Button & Fixing Docker build button

* Fix Codacy issue in Timeline.java. Related #1070

Link to codacy: https://www.codacy.com/app/sudheesh1995/loklak_server/file/6470204147/issues/source?bid=3495500&fileBranchId=3495500
Description: Fields should be declared at the top of the class

* Fix Codacy issue for some files in org.loklak.server.api. Related #1070

* ConsoleService.java
  - Fields should be declared at the top of the class
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484902617/issues/source?bid=3495500&fileBranchId=3495500

* EventBriteCrawler.java
  - Make spacing consistent for conditionals

* GraphServlet.java
  - Reduce complexity of doGet method
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484903642/issues/source?bid=3495500&fileBranchId=3495500

* Rename Dockerfile-learnings.md to docs/Dockerfile-learnings.md

* fix #1138: Correct spelling mistake in README.md (#1140)

Change "descripe" to "describe" in How to Contribute section.

* Fixes #1123: Adding Gemnasium Button & Fixing Docker build button in rst file (#1137)

* Related #1070: Fix Codacy issues for files in org.loklak.api.search (#1134)

* EventBriteCrawlerService.java
  - Use one line for each declaration
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733425/issues/source?bid=3495500&fileBranchId=3495500

* GenericScraper.java
  - Indentation fix
  - New line before EOF

* GithubProfileScraper.java
  - Remove trailing whitespaces

* MeetupsCrawlerService.java
  - Use one line for each declaration
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733676/issues/source?bid=3495500&fileBranchId=3495500

* SearchServlet.java
  - Indentation fix

* SuggestServlet.java
  - Position literals first in String comparisons
  - Fields should be declared at the top of the class
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733640/issues/source?bid=3495500&fileBranchId=3495500

* WeiboUserInfo.java
  - Switch statements should have a default label
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733550/issues/source?bid=3495500&fileBranchId=3495500

* Fixes #1139: Changed the URL (#1141)

* Fixes #1139: Changed the URL

* Fixes #1139: Changed the URL

* Fix "Strings must use doublequote. (quotes)"
Related to #1070

* Fix #1070: Strings must use doublequote. (quotes), no-use-before-define

* Fix #1070: Strings must use doublequote. (quotes), no-use-before-define

* Fixes #1070:Strings must use double quotes, no-use-before-define

* Related to #1070:Strings must use double quotes, no-use-before-define

* Related #1058: Add Kaizen harvester usage documentation (#1145)

* Fix #1070: Strings must use doublequote. (quotes), no-use-before-define (#1121)

* Fix "Strings must use doublequote. (quotes)"
Related to #1070

* Fix #1070: Strings must use doublequote. (quotes), no-use-before-define

* Fix #1070: Strings must use doublequote. (quotes), no-use-before-define

* fix #1130: Make retries and back off parameter for backend push configurable (#1131)

These variables can be set from config.properties by changing/defining caretaker.backendpush.retries and caretaker.backendpush.backoff respectively.

* Fixes part of #1132: Add unit test to check TwitterScraper output (#1133)

* convert markdown file to rst (#1142)

* Merged development fixed conflict.

* Improve code quality for org.loklak.geo.*

* Related #1070: Improve code quality for org.loklak.api.admin.* (#1149)

* Related to #1070: Improve code quality for org.loklak.Crawler.java

* fix related to #1152: code refractoring for logging (#1153)

* fix related to #1133: fix access specifiers (#1151)

* fixes #1161: Add GCloud Kubernetes deployment document for loklak (#1162)

* fixes #1146: Check for TwitterFactory before getting instance (#1147)

* Related #1070: Fix Codacy issues for org.loklak.api.amazon.* (#1163)

* fix #1143: Fix NumberException in YoutubeScraper (#1157)

* Installation and Start on a user specified port (#1159)

Solves issue: #925

* Fixes #1165: Fixed the QuoraProfileScraper and displaying profileImage

* Related to #1112:Add filter for images, videos (#1164)

* Related #1156: Make harvesting decision biased for Kaizen (#1158)

A probability is chosen as queuries.size() / QUERIES_LIMIT, which is compared to a randomly chosen target probability and decision is taken accordingly. In case of no limit on the queue size, probability to harvest is set to 0.5.

* Fixes #1167 GithubScraperService able to scrape user specific data (#1168)

Fixes issue #1167.githubprofilescraper service now displays starred_url,
number of starred repos,followers_url, number of followers, following_url,
number of people following for a particuler user.

* fixes #1114 Improve URL shortening service

* Include all 30X HTTP response code while checking for redirect.
* Use POST requests as fallback for GET requests - There are many cases (mostly https?://fb.me/*) when GET requests give status 400: Bad Request, while POST request works fine. The patch will allow to make an attemt for POST request for such cases and fetch the result.
* Try to fetch URL from <meta/> tag in response body in case of non redirect status code.
* Check the validity of URL shortening only once, and not for each intermediate URL.

* Displays proper url to open loklak_server

Solves issue: #1172

Displays proper localhost url in which loklak_server is running after
the execution of bin/start.sh or bin/installation.sh with a "p" flag.

Earlier the localhost url only displayed port 9000 at the end in case of
bin/start.sh and concatenated the running port with 9000 in case of
bin/stop.sh.Ex:
http://localhost:9000 # bin/start.sh, actual port 8888
http://localhost:90008888 #bin/installation.sh, actual port 8888

* fixes #1177 - Added tests for WordpressCrawlerService.java

fixes issue #1177. Added tests for WordpressCrawlerService.java and
also removed the leading 'Author' from the author field in json
output.

* fix #1176: Fetch debug flag from config file

Change configurations for TwitterScraper and ClientConnection

* fixes #1184 - Instagram Profile Scraper is now working

fixes issue #1184. Instagram scraper is now returning data.

* fix #1179: Use java.net.URL to build relative URL in ClientConnection (#1183)

* fixes #1070: Add test for URL unshortening (#1173)

* fixes #1169 - Added test for Github profile scraper (#1185)

fixes issue #1169, Added tests for GithubProfileScraper service.

* Improve code quality for some files in org.loklak.api.cms and add checkstyle as gradle task (#1187)

* Related #1070: Improve code quality for some files in org.loklak.api.cms

Fixes are done using checkstyle with google_check.xml config and 4 space indentation level

* Add checkstyle check as gradle task

* Fixes #1191: NullPointerException in CareTaker.java (#1192)

* Auto-generate docs in dev.loklak.org repository (#1195)

* Fix #1171: Extract video URLs from IFrame (#1193)

Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format.

Also add org.unbescape as gradle dependency to unescape string in iframe.

* FIx #1201: Break down KaizenHarvester into simpler pieces (#1203)

Introduce KaizenQuery class to support different methods to store queries that Kaizen needs to process

* Fix #1208: Add .editorconfig (#1209)

* Fixes #1204 Add subtree if not already added (#1207)

* Fix #1205: Extract complete video URLs for Tweets (#1206)

This implementation mimics the video playback flow of mobile react app of Twitter.
    1. Extract BEARER_TOKEN holding script's URL.
    2. Extract guest session token.
    3. Extract BEARER_TOKEN from URL in 1.
    4. Make Twitter API call with the parameters.

* fixes #1196 - Enhanced Quora profile scraper #1199 (#1200)

Fixes issue #1196 The scraper now provides more information like
university of user, location where user works, topics he knows, number of
followers, number of questions, number of edits, number of blogs etc.

* Fix #1188: Use unbescape to unescape HTML in html2utf8 (#1194)

Also improve whitespace cleaning in the method. Move old implementation to html2utf8Custom.

* Fixes #1097: Restore access specifiers in TwitterScraper.java (#1198)

* Fix indentation (#1211)

* Fix #1212: fix checkstyle errors(except missing javadoc) (#1218)

* Fixes #1215 fix syntax error in the script (#1217)

* Fix #1213: Include videos for testing TwitterScraper (#1221)

* Fix 1216: Revert "Installation and Start on a user specified port (#1159)" (#1227)

This reverts commit 1e0bcd5.

Conflicts (resolved):
	bin/installation.sh
	bin/start.sh

* Fixes #1202: Modify loggers in Loklak Server for testing (#1222)

* Fixes #1219: Add UTC time in TimeAndDateService (#1220)

* Fixes #1112: Add image, video filter constraints for cache (#1190)

* Fixes #1236: Update Docs for get parameter (#1237)

* Fixes #1226 Build error currently showing (#1228)

* Fixes 1215 Fix relative link

* Update git to work with subtree

* Adding echo statements

* Fix #1239: Correct flag values in config.properties

* Fix #1238: Add PriorityQueue harvesting strategy (#1240)

Also add score related to each Tweet based on retweet and favourite count.

* Fix #1251: Correct test case for RedirectUnshortener (#1253)

http://t.co/E3w7s2qdBT now points to http://www.mostviralfeed.com/what-lady-gaga-actually-looks-like instead of http://mostviralfeed.com/what-lady-gaga-actually-looks-like

* Fix #1247: Add function to collect stats about all classes for a classifier (#1248)

* Fix #1256: Add classifier.json endpoint to serve aggregated data (#1257)

* refactoring to have the same naming as in susi_server

* Fixes #1261: RedirectUnshortener link fix (#1262)

* Fixes #1229, #1235, Related #1230: Setup of testable version (#1250)

1) setup post and basescraper

2) Setup quoraprofilescraper with basescraper and post

* Fix #1259: Add function for time sensitive aggregation (#1260)

* Fix #1271: Correct redirect link in test (#1272)

* Fix #1266: Allow time based aggregation in /api/classifier.json (#1267)

* Fix #1278: Correct typo in kaizen.md (#1279)

* enhanced elasticsearch mapping

* eclipse classpath to use same as gradle

* removed unused imports

* Fix #1268: Add function for aggregation based on country codes (#1270)

Following operations are now possible -
* All time aggregation for all countries
* Time sensitive aggregation for all countries
* All previous aggregations for selected countries

* Fix #1273: Add Jacoco to provide coverage report in XML format (#1274)

* Fixes 1284: Improve test cases for URL unshortener (#1285)

* Setup post and basescraper with QuoraProfileScraper (#1249)

* Setup of testable version

setup post and basescraper

* Related #1230, 1231, 1244: integrate Timeline2 with quorascraper

* Configure ssh agent before push

vibhcool added a commit to vibhcool/loklak_server that referenced this pull request Jul 6, 2017

Update Master with Development branch (loklak#1281)
* deploy button info for docker loklak#1001

* Fixes loklak#1045 : Replace the image logo in navigation bar with a text

Fixes loklak#1045 : Replace the image logo in navigation bar with a text

* Fixes loklak#1048: fix execution of method without query string

* fix for latest twitter html change

* add docker status badge

This is related to issue loklak#1049

* update documentation path

Problem: The documentation has moved. The links in the README are outdated
Solution: insert the containing folder into the path of all links to docs

* moved dockerfile to root folder loklak#1049

* changed travis build to new location

* updated Dockerfile path for compose

This is for issue loklak#1049.
The pull request loklak#1050 is a precondition for this to make sense.

Problem: the Dockerfile was moved to /
Solution: adapt the path

* get aggregations also with fresh requests from twitter with source=all

* Fixes loklak#1060: Increase default Xmx value

* Fixed loklak#1059 - Remove and Ignore .DS_Store

* Fixes loklak#1067 - Tweet URL in README is broken

* corrected heading

"Where do I find the java?" ->"Where do I find the Java documentation?"

* Using the note directive of sphinx

See loklak#1042

* README.md upd, useful links added

* Fixes loklak#1033, loklak_server README.md upd, links updated with link syntax

* Move documentation site

The documentation site is now moved to https://github.com/loklak/dev.loklak.org

Closes loklak#1014

* fix username emoji in tweet

* Fix unused imports in python files(codacy issue)

Related to loklak#1070

* removed .DS_Store

* added .DS_Store to gitignore

* Fix use of Null in scala code

Related to loklak#1070

* fixed scraper

* Edited Readme

* Add update trigger script for docs

Closes loklak#1003

* Creating Volume for persistence while deploying via docker, fix loklak#1051 (loklak#1089)

* Update Dockerfile

* Update Dockerfile

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* updated docker build badge

I changed the url so github requests a new image.
The build works.
https://hub.docker.com/r/mariobehling/loklak/builds/

* Docker: Consistent Volume Path

Problem: docker-compose volume path is not the same as the dockerfile volume path
Solution: Set the docker-compose volume path to the dockerfile volume path

You can view the correct path in the Dockerfile:
https://github.com/loklak/loklak_server/blob/7a1f0378dc40ec25eec6083e43558a62408d84e8/Dockerfile#L38
I checked in the container:
```
bash-4.3# ls /loklak_server/
bin              conf             gradlew          settings.gradle
build            data             html             src
build.gradle     gradle           installation     ssi
bash-4.3# ls /
bin            lib            proc           srv            var
dev            loklak_server  root           sys
etc            media          run            tmp
home           mnt            sbin           usr
```
the data directory exists and is filled within `/loklak_server`

* .travis.yml: Add keys for dev.loklak.org

Closes loklak#1091

* fix initGet

* option to autodelete messages after one month from the main index

* disabling feature introduced with
27272ee
for issue loklak#919

The storage of the settings file caused that the settings file was
broken. It blew up to a huge file, like
$ ls -l customized_config.properties
-rw-r--r-- 1 loklak loklak 251650030 Apr 10 19:08
customized_config.properties

This is the main cause that loklak.org was down since this feature was
introduced.

* Fixes loklak#1099 : Changes the href link of the button download, install and extend

* fix loklak#1056 - document how to start contributing (loklak#1063)

* Added JS EventListener to resize dump iframe on load. Closes loklak#1101

* Add Unit Tests to Loklak Server (loklak#1098)

* Add unit tests for TwitterScraper.java

* Add data file to test JSONRandomAccessFileTest.java

* set up unit tests build in loklak Server

* fix changes requested and codacy issues

* fixes scrollbar event

* at the twitter scraper now use more readable version of assert, also fix bug with parse long in youtube scraper(fails on Long.parse method, because spaces are not removed), add unit test for youtube scrapper.

* fix bug with youtube scrapper and add unit test for scraper

* Fixes loklak#1103: Changed the URLs to the correct ones (loklak#1104)

* Fixes loklak#1103: Changed the URLs to the correct ones

* Fixes loklak#1108: Fixed the typos in documentation

* fix and modify the GithubProfileScraper.java

* fixes loklak#961: add query in KaizenHarverster's queue to get older Tweets

In case if the current timeline's query already has an until statement, replace it's date part with the oldest one. Also add DateFormat object in KaizenHarverster to parse Date into String of format yyyy-MM-dd.

* fix eclipse classpath for storing classes (loklak#1097)

* Fixes loklak#1123: Adding Gemnasium Button & Fixing Docker build button

* Fix Codacy issue in Timeline.java. Related loklak#1070

Link to codacy: https://www.codacy.com/app/sudheesh1995/loklak_server/file/6470204147/issues/source?bid=3495500&fileBranchId=3495500
Description: Fields should be declared at the top of the class

* Fix Codacy issue for some files in org.loklak.server.api. Related loklak#1070

* ConsoleService.java
  - Fields should be declared at the top of the class
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484902617/issues/source?bid=3495500&fileBranchId=3495500

* EventBriteCrawler.java
  - Make spacing consistent for conditionals

* GraphServlet.java
  - Reduce complexity of doGet method
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484903642/issues/source?bid=3495500&fileBranchId=3495500

* Rename Dockerfile-learnings.md to docs/Dockerfile-learnings.md

* fix loklak#1138: Correct spelling mistake in README.md (loklak#1140)

Change "descripe" to "describe" in How to Contribute section.

* Fixes loklak#1123: Adding Gemnasium Button & Fixing Docker build button in rst file (loklak#1137)

* Related loklak#1070: Fix Codacy issues for files in org.loklak.api.search (loklak#1134)

* EventBriteCrawlerService.java
  - Use one line for each declaration
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733425/issues/source?bid=3495500&fileBranchId=3495500

* GenericScraper.java
  - Indentation fix
  - New line before EOF

* GithubProfileScraper.java
  - Remove trailing whitespaces

* MeetupsCrawlerService.java
  - Use one line for each declaration
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733676/issues/source?bid=3495500&fileBranchId=3495500

* SearchServlet.java
  - Indentation fix

* SuggestServlet.java
  - Position literals first in String comparisons
  - Fields should be declared at the top of the class
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733640/issues/source?bid=3495500&fileBranchId=3495500

* WeiboUserInfo.java
  - Switch statements should have a default label
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733550/issues/source?bid=3495500&fileBranchId=3495500

* Fixes loklak#1139: Changed the URL (loklak#1141)

* Fixes loklak#1139: Changed the URL

* Fixes loklak#1139: Changed the URL

* Fix "Strings must use doublequote. (quotes)"
Related to loklak#1070

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* Fixes loklak#1070:Strings must use double quotes, no-use-before-define

* Related to loklak#1070:Strings must use double quotes, no-use-before-define

* Related loklak#1058: Add Kaizen harvester usage documentation (loklak#1145)

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define (loklak#1121)

* Fix "Strings must use doublequote. (quotes)"
Related to loklak#1070

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* fix loklak#1130: Make retries and back off parameter for backend push configurable (loklak#1131)

These variables can be set from config.properties by changing/defining caretaker.backendpush.retries and caretaker.backendpush.backoff respectively.

* Fixes part of loklak#1132: Add unit test to check TwitterScraper output (loklak#1133)

* convert markdown file to rst (loklak#1142)

* Merged development fixed conflict.

* Improve code quality for org.loklak.geo.*

* Related loklak#1070: Improve code quality for org.loklak.api.admin.* (loklak#1149)

* Related to loklak#1070: Improve code quality for org.loklak.Crawler.java

* fix related to loklak#1152: code refractoring for logging (loklak#1153)

* fix related to loklak#1133: fix access specifiers (loklak#1151)

* fixes loklak#1161: Add GCloud Kubernetes deployment document for loklak (loklak#1162)

* fixes loklak#1146: Check for TwitterFactory before getting instance (loklak#1147)

* Related loklak#1070: Fix Codacy issues for org.loklak.api.amazon.* (loklak#1163)

* fix loklak#1143: Fix NumberException in YoutubeScraper (loklak#1157)

* Installation and Start on a user specified port (loklak#1159)

Solves issue: loklak#925

* Fixes loklak#1165: Fixed the QuoraProfileScraper and displaying profileImage

* Related to loklak#1112:Add filter for images, videos (loklak#1164)

* Related loklak#1156: Make harvesting decision biased for Kaizen (loklak#1158)

A probability is chosen as queuries.size() / QUERIES_LIMIT, which is compared to a randomly chosen target probability and decision is taken accordingly. In case of no limit on the queue size, probability to harvest is set to 0.5.

* Fixes loklak#1167 GithubScraperService able to scrape user specific data (loklak#1168)

Fixes issue loklak#1167.githubprofilescraper service now displays starred_url,
number of starred repos,followers_url, number of followers, following_url,
number of people following for a particuler user.

* fixes loklak#1114 Improve URL shortening service

* Include all 30X HTTP response code while checking for redirect.
* Use POST requests as fallback for GET requests - There are many cases (mostly https?://fb.me/*) when GET requests give status 400: Bad Request, while POST request works fine. The patch will allow to make an attemt for POST request for such cases and fetch the result.
* Try to fetch URL from <meta/> tag in response body in case of non redirect status code.
* Check the validity of URL shortening only once, and not for each intermediate URL.

* Displays proper url to open loklak_server

Solves issue: loklak#1172

Displays proper localhost url in which loklak_server is running after
the execution of bin/start.sh or bin/installation.sh with a "p" flag.

Earlier the localhost url only displayed port 9000 at the end in case of
bin/start.sh and concatenated the running port with 9000 in case of
bin/stop.sh.Ex:
http://localhost:9000 # bin/start.sh, actual port 8888
http://localhost:90008888 #bin/installation.sh, actual port 8888

* fixes loklak#1177 - Added tests for WordpressCrawlerService.java

fixes issue loklak#1177. Added tests for WordpressCrawlerService.java and
also removed the leading 'Author' from the author field in json
output.

* fix loklak#1176: Fetch debug flag from config file

Change configurations for TwitterScraper and ClientConnection

* fixes loklak#1184 - Instagram Profile Scraper is now working

fixes issue loklak#1184. Instagram scraper is now returning data.

* fix loklak#1179: Use java.net.URL to build relative URL in ClientConnection (loklak#1183)

* fixes loklak#1070: Add test for URL unshortening (loklak#1173)

* fixes loklak#1169 - Added test for Github profile scraper (loklak#1185)

fixes issue loklak#1169, Added tests for GithubProfileScraper service.

* Improve code quality for some files in org.loklak.api.cms and add checkstyle as gradle task (loklak#1187)

* Related loklak#1070: Improve code quality for some files in org.loklak.api.cms

Fixes are done using checkstyle with google_check.xml config and 4 space indentation level

* Add checkstyle check as gradle task

* Fixes loklak#1191: NullPointerException in CareTaker.java (loklak#1192)

* Auto-generate docs in dev.loklak.org repository (loklak#1195)

* Fix loklak#1171: Extract video URLs from IFrame (loklak#1193)

Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format.

Also add org.unbescape as gradle dependency to unescape string in iframe.

* FIx loklak#1201: Break down KaizenHarvester into simpler pieces (loklak#1203)

Introduce KaizenQuery class to support different methods to store queries that Kaizen needs to process

* Fix loklak#1208: Add .editorconfig (loklak#1209)

* Fixes loklak#1204 Add subtree if not already added (loklak#1207)

* Fix loklak#1205: Extract complete video URLs for Tweets (loklak#1206)

This implementation mimics the video playback flow of mobile react app of Twitter.
    1. Extract BEARER_TOKEN holding script's URL.
    2. Extract guest session token.
    3. Extract BEARER_TOKEN from URL in 1.
    4. Make Twitter API call with the parameters.

* fixes loklak#1196 - Enhanced Quora profile scraper loklak#1199 (loklak#1200)

Fixes issue loklak#1196 The scraper now provides more information like
university of user, location where user works, topics he knows, number of
followers, number of questions, number of edits, number of blogs etc.

* Fix loklak#1188: Use unbescape to unescape HTML in html2utf8 (loklak#1194)

Also improve whitespace cleaning in the method. Move old implementation to html2utf8Custom.

* Fixes loklak#1097: Restore access specifiers in TwitterScraper.java (loklak#1198)

* Fix indentation (loklak#1211)

* Fix loklak#1212: fix checkstyle errors(except missing javadoc) (loklak#1218)

* Fixes loklak#1215 fix syntax error in the script (loklak#1217)

* Fix loklak#1213: Include videos for testing TwitterScraper (loklak#1221)

* Fix 1216: Revert "Installation and Start on a user specified port (loklak#1159)" (loklak#1227)

This reverts commit 1e0bcd5.

Conflicts (resolved):
	bin/installation.sh
	bin/start.sh

* Fixes loklak#1202: Modify loggers in Loklak Server for testing (loklak#1222)

* Fixes loklak#1219: Add UTC time in TimeAndDateService (loklak#1220)

* Fixes loklak#1112: Add image, video filter constraints for cache (loklak#1190)

* Fixes loklak#1236: Update Docs for get parameter (loklak#1237)

* Fixes loklak#1226 Build error currently showing (loklak#1228)

* Fixes 1215 Fix relative link

* Update git to work with subtree

* Adding echo statements

* Fix loklak#1239: Correct flag values in config.properties

* Fix loklak#1238: Add PriorityQueue harvesting strategy (loklak#1240)

Also add score related to each Tweet based on retweet and favourite count.

* Fix loklak#1251: Correct test case for RedirectUnshortener (loklak#1253)

http://t.co/E3w7s2qdBT now points to http://www.mostviralfeed.com/what-lady-gaga-actually-looks-like instead of http://mostviralfeed.com/what-lady-gaga-actually-looks-like

* Fix loklak#1247: Add function to collect stats about all classes for a classifier (loklak#1248)

* Fix loklak#1256: Add classifier.json endpoint to serve aggregated data (loklak#1257)

* refactoring to have the same naming as in susi_server

* Fixes loklak#1261: RedirectUnshortener link fix (loklak#1262)

* Fixes loklak#1229, loklak#1235, Related loklak#1230: Setup of testable version (loklak#1250)

1) setup post and basescraper

2) Setup quoraprofilescraper with basescraper and post

* Fix loklak#1259: Add function for time sensitive aggregation (loklak#1260)

* Fix loklak#1271: Correct redirect link in test (loklak#1272)

* Fix loklak#1266: Allow time based aggregation in /api/classifier.json (loklak#1267)

* Fix loklak#1278: Correct typo in kaizen.md (loklak#1279)

* enhanced elasticsearch mapping

* eclipse classpath to use same as gradle

* removed unused imports

* Fix loklak#1268: Add function for aggregation based on country codes (loklak#1270)

Following operations are now possible -
* All time aggregation for all countries
* Time sensitive aggregation for all countries
* All previous aggregations for selected countries

* Fix loklak#1273: Add Jacoco to provide coverage report in XML format (loklak#1274)

* Fixes 1284: Improve test cases for URL unshortener (loklak#1285)

* Setup post and basescraper with QuoraProfileScraper (loklak#1249)

* Setup of testable version

setup post and basescraper

* Related loklak#1230, 1231, 1244: integrate Timeline2 with quorascraper

* Configure ssh agent before push

vibhcool added a commit to vibhcool/loklak_server that referenced this pull request Jul 6, 2017

Update Master with Development branch (loklak#1281)
* deploy button info for docker loklak#1001

* Fixes loklak#1045 : Replace the image logo in navigation bar with a text

Fixes loklak#1045 : Replace the image logo in navigation bar with a text

* Fixes loklak#1048: fix execution of method without query string

* fix for latest twitter html change

* add docker status badge

This is related to issue loklak#1049

* update documentation path

Problem: The documentation has moved. The links in the README are outdated
Solution: insert the containing folder into the path of all links to docs

* moved dockerfile to root folder loklak#1049

* changed travis build to new location

* updated Dockerfile path for compose

This is for issue loklak#1049.
The pull request loklak#1050 is a precondition for this to make sense.

Problem: the Dockerfile was moved to /
Solution: adapt the path

* get aggregations also with fresh requests from twitter with source=all

* Fixes loklak#1060: Increase default Xmx value

* Fixed loklak#1059 - Remove and Ignore .DS_Store

* Fixes loklak#1067 - Tweet URL in README is broken

* corrected heading

"Where do I find the java?" ->"Where do I find the Java documentation?"

* Using the note directive of sphinx

See loklak#1042

* README.md upd, useful links added

* Fixes loklak#1033, loklak_server README.md upd, links updated with link syntax

* Move documentation site

The documentation site is now moved to https://github.com/loklak/dev.loklak.org

Closes loklak#1014

* fix username emoji in tweet

* Fix unused imports in python files(codacy issue)

Related to loklak#1070

* removed .DS_Store

* added .DS_Store to gitignore

* Fix use of Null in scala code

Related to loklak#1070

* fixed scraper

* Edited Readme

* Add update trigger script for docs

Closes loklak#1003

* Creating Volume for persistence while deploying via docker, fix loklak#1051 (loklak#1089)

* Update Dockerfile

* Update Dockerfile

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* Update docker-compose.yml

* updated docker build badge

I changed the url so github requests a new image.
The build works.
https://hub.docker.com/r/mariobehling/loklak/builds/

* Docker: Consistent Volume Path

Problem: docker-compose volume path is not the same as the dockerfile volume path
Solution: Set the docker-compose volume path to the dockerfile volume path

You can view the correct path in the Dockerfile:
https://github.com/loklak/loklak_server/blob/7a1f0378dc40ec25eec6083e43558a62408d84e8/Dockerfile#L38
I checked in the container:
```
bash-4.3# ls /loklak_server/
bin              conf             gradlew          settings.gradle
build            data             html             src
build.gradle     gradle           installation     ssi
bash-4.3# ls /
bin            lib            proc           srv            var
dev            loklak_server  root           sys
etc            media          run            tmp
home           mnt            sbin           usr
```
the data directory exists and is filled within `/loklak_server`

* .travis.yml: Add keys for dev.loklak.org

Closes loklak#1091

* fix initGet

* option to autodelete messages after one month from the main index

* disabling feature introduced with
27272ee
for issue loklak#919

The storage of the settings file caused that the settings file was
broken. It blew up to a huge file, like
$ ls -l customized_config.properties
-rw-r--r-- 1 loklak loklak 251650030 Apr 10 19:08
customized_config.properties

This is the main cause that loklak.org was down since this feature was
introduced.

* Fixes loklak#1099 : Changes the href link of the button download, install and extend

* fix loklak#1056 - document how to start contributing (loklak#1063)

* Added JS EventListener to resize dump iframe on load. Closes loklak#1101

* Add Unit Tests to Loklak Server (loklak#1098)

* Add unit tests for TwitterScraper.java

* Add data file to test JSONRandomAccessFileTest.java

* set up unit tests build in loklak Server

* fix changes requested and codacy issues

* fixes scrollbar event

* at the twitter scraper now use more readable version of assert, also fix bug with parse long in youtube scraper(fails on Long.parse method, because spaces are not removed), add unit test for youtube scrapper.

* fix bug with youtube scrapper and add unit test for scraper

* Fixes loklak#1103: Changed the URLs to the correct ones (loklak#1104)

* Fixes loklak#1103: Changed the URLs to the correct ones

* Fixes loklak#1108: Fixed the typos in documentation

* fix and modify the GithubProfileScraper.java

* fixes loklak#961: add query in KaizenHarverster's queue to get older Tweets

In case if the current timeline's query already has an until statement, replace it's date part with the oldest one. Also add DateFormat object in KaizenHarverster to parse Date into String of format yyyy-MM-dd.

* fix eclipse classpath for storing classes (loklak#1097)

* Fixes loklak#1123: Adding Gemnasium Button & Fixing Docker build button

* Fix Codacy issue in Timeline.java. Related loklak#1070

Link to codacy: https://www.codacy.com/app/sudheesh1995/loklak_server/file/6470204147/issues/source?bid=3495500&fileBranchId=3495500
Description: Fields should be declared at the top of the class

* Fix Codacy issue for some files in org.loklak.server.api. Related loklak#1070

* ConsoleService.java
  - Fields should be declared at the top of the class
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484902617/issues/source?bid=3495500&fileBranchId=3495500

* EventBriteCrawler.java
  - Make spacing consistent for conditionals

* GraphServlet.java
  - Reduce complexity of doGet method
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6484903642/issues/source?bid=3495500&fileBranchId=3495500

* Rename Dockerfile-learnings.md to docs/Dockerfile-learnings.md

* fix loklak#1138: Correct spelling mistake in README.md (loklak#1140)

Change "descripe" to "describe" in How to Contribute section.

* Fixes loklak#1123: Adding Gemnasium Button & Fixing Docker build button in rst file (loklak#1137)

* Related loklak#1070: Fix Codacy issues for files in org.loklak.api.search (loklak#1134)

* EventBriteCrawlerService.java
  - Use one line for each declaration
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733425/issues/source?bid=3495500&fileBranchId=3495500

* GenericScraper.java
  - Indentation fix
  - New line before EOF

* GithubProfileScraper.java
  - Remove trailing whitespaces

* MeetupsCrawlerService.java
  - Use one line for each declaration
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733676/issues/source?bid=3495500&fileBranchId=3495500

* SearchServlet.java
  - Indentation fix

* SuggestServlet.java
  - Position literals first in String comparisons
  - Fields should be declared at the top of the class
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733640/issues/source?bid=3495500&fileBranchId=3495500

* WeiboUserInfo.java
  - Switch statements should have a default label
  - https://www.codacy.com/app/sudheesh1995/loklak_server/file/6497733550/issues/source?bid=3495500&fileBranchId=3495500

* Fixes loklak#1139: Changed the URL (loklak#1141)

* Fixes loklak#1139: Changed the URL

* Fixes loklak#1139: Changed the URL

* Fix "Strings must use doublequote. (quotes)"
Related to loklak#1070

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* Fixes loklak#1070:Strings must use double quotes, no-use-before-define

* Related to loklak#1070:Strings must use double quotes, no-use-before-define

* Related loklak#1058: Add Kaizen harvester usage documentation (loklak#1145)

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define (loklak#1121)

* Fix "Strings must use doublequote. (quotes)"
Related to loklak#1070

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* Fix loklak#1070: Strings must use doublequote. (quotes), no-use-before-define

* fix loklak#1130: Make retries and back off parameter for backend push configurable (loklak#1131)

These variables can be set from config.properties by changing/defining caretaker.backendpush.retries and caretaker.backendpush.backoff respectively.

* Fixes part of loklak#1132: Add unit test to check TwitterScraper output (loklak#1133)

* convert markdown file to rst (loklak#1142)

* Merged development fixed conflict.

* Improve code quality for org.loklak.geo.*

* Related loklak#1070: Improve code quality for org.loklak.api.admin.* (loklak#1149)

* Related to loklak#1070: Improve code quality for org.loklak.Crawler.java

* fix related to loklak#1152: code refractoring for logging (loklak#1153)

* fix related to loklak#1133: fix access specifiers (loklak#1151)

* fixes loklak#1161: Add GCloud Kubernetes deployment document for loklak (loklak#1162)

* fixes loklak#1146: Check for TwitterFactory before getting instance (loklak#1147)

* Related loklak#1070: Fix Codacy issues for org.loklak.api.amazon.* (loklak#1163)

* fix loklak#1143: Fix NumberException in YoutubeScraper (loklak#1157)

* Installation and Start on a user specified port (loklak#1159)

Solves issue: loklak#925

* Fixes loklak#1165: Fixed the QuoraProfileScraper and displaying profileImage

* Related to loklak#1112:Add filter for images, videos (loklak#1164)

* Related loklak#1156: Make harvesting decision biased for Kaizen (loklak#1158)

A probability is chosen as queuries.size() / QUERIES_LIMIT, which is compared to a randomly chosen target probability and decision is taken accordingly. In case of no limit on the queue size, probability to harvest is set to 0.5.

* Fixes loklak#1167 GithubScraperService able to scrape user specific data (loklak#1168)

Fixes issue loklak#1167.githubprofilescraper service now displays starred_url,
number of starred repos,followers_url, number of followers, following_url,
number of people following for a particuler user.

* fixes loklak#1114 Improve URL shortening service

* Include all 30X HTTP response code while checking for redirect.
* Use POST requests as fallback for GET requests - There are many cases (mostly https?://fb.me/*) when GET requests give status 400: Bad Request, while POST request works fine. The patch will allow to make an attemt for POST request for such cases and fetch the result.
* Try to fetch URL from <meta/> tag in response body in case of non redirect status code.
* Check the validity of URL shortening only once, and not for each intermediate URL.

* Displays proper url to open loklak_server

Solves issue: loklak#1172

Displays proper localhost url in which loklak_server is running after
the execution of bin/start.sh or bin/installation.sh with a "p" flag.

Earlier the localhost url only displayed port 9000 at the end in case of
bin/start.sh and concatenated the running port with 9000 in case of
bin/stop.sh.Ex:
http://localhost:9000 # bin/start.sh, actual port 8888
http://localhost:90008888 #bin/installation.sh, actual port 8888

* fixes loklak#1177 - Added tests for WordpressCrawlerService.java

fixes issue loklak#1177. Added tests for WordpressCrawlerService.java and
also removed the leading 'Author' from the author field in json
output.

* fix loklak#1176: Fetch debug flag from config file

Change configurations for TwitterScraper and ClientConnection

* fixes loklak#1184 - Instagram Profile Scraper is now working

fixes issue loklak#1184. Instagram scraper is now returning data.

* fix loklak#1179: Use java.net.URL to build relative URL in ClientConnection (loklak#1183)

* fixes loklak#1070: Add test for URL unshortening (loklak#1173)

* fixes loklak#1169 - Added test for Github profile scraper (loklak#1185)

fixes issue loklak#1169, Added tests for GithubProfileScraper service.

* Improve code quality for some files in org.loklak.api.cms and add checkstyle as gradle task (loklak#1187)

* Related loklak#1070: Improve code quality for some files in org.loklak.api.cms

Fixes are done using checkstyle with google_check.xml config and 4 space indentation level

* Add checkstyle check as gradle task

* Fixes loklak#1191: NullPointerException in CareTaker.java (loklak#1192)

* Auto-generate docs in dev.loklak.org repository (loklak#1195)

* Fix loklak#1171: Extract video URLs from IFrame (loklak#1193)

Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format.

Also add org.unbescape as gradle dependency to unescape string in iframe.

* FIx loklak#1201: Break down KaizenHarvester into simpler pieces (loklak#1203)

Introduce KaizenQuery class to support different methods to store queries that Kaizen needs to process

* Fix loklak#1208: Add .editorconfig (loklak#1209)

* Fixes loklak#1204 Add subtree if not already added (loklak#1207)

* Fix loklak#1205: Extract complete video URLs for Tweets (loklak#1206)

This implementation mimics the video playback flow of mobile react app of Twitter.
    1. Extract BEARER_TOKEN holding script's URL.
    2. Extract guest session token.
    3. Extract BEARER_TOKEN from URL in 1.
    4. Make Twitter API call with the parameters.

* fixes loklak#1196 - Enhanced Quora profile scraper loklak#1199 (loklak#1200)

Fixes issue loklak#1196 The scraper now provides more information like
university of user, location where user works, topics he knows, number of
followers, number of questions, number of edits, number of blogs etc.

* Fix loklak#1188: Use unbescape to unescape HTML in html2utf8 (loklak#1194)

Also improve whitespace cleaning in the method. Move old implementation to html2utf8Custom.

* Fixes loklak#1097: Restore access specifiers in TwitterScraper.java (loklak#1198)

* Fix indentation (loklak#1211)

* Fix loklak#1212: fix checkstyle errors(except missing javadoc) (loklak#1218)

* Fixes loklak#1215 fix syntax error in the script (loklak#1217)

* Fix loklak#1213: Include videos for testing TwitterScraper (loklak#1221)

* Fix 1216: Revert "Installation and Start on a user specified port (loklak#1159)" (loklak#1227)

This reverts commit 1e0bcd5.

Conflicts (resolved):
	bin/installation.sh
	bin/start.sh

* Fixes loklak#1202: Modify loggers in Loklak Server for testing (loklak#1222)

* Fixes loklak#1219: Add UTC time in TimeAndDateService (loklak#1220)

* Fixes loklak#1112: Add image, video filter constraints for cache (loklak#1190)

* Fixes loklak#1236: Update Docs for get parameter (loklak#1237)

* Fixes loklak#1226 Build error currently showing (loklak#1228)

* Fixes 1215 Fix relative link

* Update git to work with subtree

* Adding echo statements

* Fix loklak#1239: Correct flag values in config.properties

* Fix loklak#1238: Add PriorityQueue harvesting strategy (loklak#1240)

Also add score related to each Tweet based on retweet and favourite count.

* Fix loklak#1251: Correct test case for RedirectUnshortener (loklak#1253)

http://t.co/E3w7s2qdBT now points to http://www.mostviralfeed.com/what-lady-gaga-actually-looks-like instead of http://mostviralfeed.com/what-lady-gaga-actually-looks-like

* Fix loklak#1247: Add function to collect stats about all classes for a classifier (loklak#1248)

* Fix loklak#1256: Add classifier.json endpoint to serve aggregated data (loklak#1257)

* refactoring to have the same naming as in susi_server

* Fixes loklak#1261: RedirectUnshortener link fix (loklak#1262)

* Fixes loklak#1229, loklak#1235, Related loklak#1230: Setup of testable version (loklak#1250)

1) setup post and basescraper

2) Setup quoraprofilescraper with basescraper and post

* Fix loklak#1259: Add function for time sensitive aggregation (loklak#1260)

* Fix loklak#1271: Correct redirect link in test (loklak#1272)

* Fix loklak#1266: Allow time based aggregation in /api/classifier.json (loklak#1267)

* Fix loklak#1278: Correct typo in kaizen.md (loklak#1279)

* enhanced elasticsearch mapping

* eclipse classpath to use same as gradle

* removed unused imports

* Fix loklak#1268: Add function for aggregation based on country codes (loklak#1270)

Following operations are now possible -
* All time aggregation for all countries
* Time sensitive aggregation for all countries
* All previous aggregations for selected countries

* Fix loklak#1273: Add Jacoco to provide coverage report in XML format (loklak#1274)

* Fixes 1284: Improve test cases for URL unshortener (loklak#1285)

* Setup post and basescraper with QuoraProfileScraper (loklak#1249)

* Setup of testable version

setup post and basescraper

* Related loklak#1230, 1231, 1244: integrate Timeline2 with quorascraper

* Configure ssh agent before push

vibhcool added a commit to vibhcool/loklak_server that referenced this pull request Jul 15, 2017

Fix loklak#1171: Extract video URLs from IFrame (loklak#1193)
Videos are added as an IFrame for Twitter. To fetch the video URLs, we first fetch the IFrame page and then check for the video format. If it is mp4, we're done. If it is m3u8, we need to fetch the m3u8 link in order to get actual videos. Mostly, these videos are of .ts format.

Also add org.unbescape as gradle dependency to unescape string in iframe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment