Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
132 lines (89 sloc) 6.53 KB

Gazouilloire

Twitter stream + search API grabber handling various config options such as collecting only during specific time periods, or limiting the collection to some locations.

HowTo

  • Install dependencies:
    sudo apt-get install mongodb-10gen
    pip install -r requirements.txt
  • Copy config.json.example to config.json

  • Set your Twitter API key and generate the related Access Token

"twitter": {
   "key": "<Consumer Key (API Key)>xxxxxxxxxxxxxxxxxxxxx",
   "secret": "<Consumer Secret (API Secret)>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
   "oauth_token": "<Access Token>xxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
   "oauth_secret": "<Access Token Secret>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
  • Write down the list of desired keywords and @users and/or the list of desired url_pieces as json arrays:

      "keywords": [
          "amour",
          "\"mots successifs\"",
          "@medialab_scpo"
      ],
      "url_pieces": [
          "medialab.sciencespo.fr/fr"
      ],

    Some advanced filters can be used in combination with the keywords, such as -undesiredkeyword, filter:links, -filter:media, -filter:retweets, etc. See Twitter API's documentation for more details.

    Avoid using accented characters (Twitter will automatically return both tweets with and without accents, for instance searching "heros" will find both tweets with "heros" and "héros").

    Note that there are three possibilities to filter further:

    • language: in order to collect only tweets written in a specific language : just add "language": "fr" to the config (the language should be written in ISO 639-1 code)
    • geolocalisation: just add "geolocalisation": "Paris, France" field to the config with the desired geographical boundaries or give in coordinates of the desired box as shown in the config example file
    • time_limited_keywords: in order to filter on specific keywords during planned time period:
    "time_limited_keywords": {
          "#m6": [
              ["2014-05-01 16:00", "2014-05-08 16:05"],
              ["2014-05-08 16:00", "2014-05-08 16:05"],
              ["2014-05-15 16:00", "2014-05-08 16:05"],
              ["2014-05-22 16:00", "2014-05-08 16:05"]
          ],
          "bieber": [
              ["2014-05-08 16:00", "2014-05-08 16:05"]
          ]
      },
  • Setup extra options:

    • resolve_redirected_links: set to true or false to enable or disable automatic resolution of all links found in tweets (t.co links are always handled, but this allows resolving also all other shorteners like bit.ly).

    • grab_conversations: set to true to activate automatic iterative collection of all tweets to which collected tweets are answering (warning: one should account for the presence of these when processing data, it often results in collecting tweets way out of the collection time period).

    • catchup_past_week: Twitter's free API allows to collect tweet up to 7 days in the past only which gazouilloire does by default, set this option to false to disable this and only collect tweets posted after the collection was started.

    • download_medias: set to true to activate automatic downloading of all medias (images and videos) posted by users within their tweets (this does not include images from social cards). Setup the medias_directory field in complement to setup the absolute path where Gazouilloire should store the images and videos on the machine.

    • timezone: adjust the timezone within which tweets timestamps should be computed. Allowed values are proposed on Gazouilloire's startup when setting up an invalid one.

  • Run with:

    ./restart.sh
    # or
    ./gazouilloire/run.py
  • Data is stored in your mongo, you can also export it easily with simple scripts such as those in the bin directory:
# To export a csv with most fields (formatted similarily to [DMI's TCAT](https://github.com/digitalmethodsinitiative/dmi-tcat)):
bin/export_csv_as_tcat.py
# To export a csv of all tweets having a specific word in their text:
bin/export_csv_as_tcat.py medialab
# To export a csv of all tweets having one of many specific words in their text:
bin/export_csv_as_tcat.py medialab digitalhumanities datajournalism '#python'
# To export a csv of all tweets matching a specific MongoDB query, for instance by user_name:
bin/export_csv_as_tcat.py "{'user_screen_name': 'medialab_ScPo'}"
# To export a csv with the most useful fields:
bin/export_csv.py
# To export the whole text content of the tweets:
bin/export_all_text.py

Publications using Gazouilloire

Publications talking about Gazouilloire

Credits & License

Benjamin Ooghe-Tabanou @ Sciences Po médialab

Discover more of our projects at médialab tools.

This work is supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Gazouilloire is a free open source software released under GPL 3.0 license.

You can’t perform that action at this time.