Skip to content

martineriksson/import_facebook_into_discourse

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Import Facebook groups to Discourse

This rake task exports data from Facebook groups and imports them to Discourse, including posts, comments, likes, images, user accounts and user tags.

Instructions for usage

  • Add this to your Gemfile:
gem 'koala'
gem 'stringex'
gem 'json'
gem 'unicode_utils'
  • Run bundle install

  • Get an access token for the Facebook Graph API (see section below for details)

  • Edit config/import_facebook.yml

  • Place config/import_facebook.yml in your config folder

  • Place lib/tasks/import_facebook.rake in your lib/tasks folder

  • Depending on your environment you might need to prepend the next command with export RAILS_DB=<your database> (multisite) or e.g. ENV=<production/development/etc>

  • Run bundle exec rake import:facebook_group

Options

In config/import_facebook.yml there are a number of configuration options:

  • facebook_group_id is the Facebook ID of the group you want to export posts from. You need to use an access token (see below) generated by someone who is an admin of this group.

  • discourse_category_name is the name of the category to which topics will be imported. It will be created if it does not already exist.

  • discourse_admin is the Discourse username of user with admin privileges which will be used to create users and the category.

  • test_mode enables a test import which produces similar log output as an actual import but does not write anything to the Discourse database. If store_data_to_files (see below) is set to true, a test run will save all fetched data to disk. It is highly recommended to first run in test mode and store to disk, then run without test mode to import from disk rather than fetching everything in real-time.

  • store_data_to_files enables saving all fetched data (i.e. every response from the API) to disk in the directory facebook-data. If a response is already saved it will be loaded from disk instead of fetched from the API. This is useful for importing large groups and generates a handy archive of all exported data. One use for this is to export data to one machine (e.g. development) and then transfer the files to import on another machine (e.g. production) without downloading everything an additional time.

  • api_call_delay represents the number of seconds to wait before each API call, to avoid API rate limiting. Setting this value to 1 will basically guarantee that you do not exceed the limit. Not needed for importing very small groups, also not needed for importing groups with large amounts of activity per post. Begin by leaving the value at 0, increase if you hit the rate limit.

  • restart_from_topic_number is used for skipping posts which have already been imported. The number is the index of the array of posts collected in the initial fetching of top-level posts. When the script ends it will tell you which number to restart from.

  • import_oldest_first if set to true will reverse the importing order, leading to the oldest posts being imported first (more specifically: the posts which have not been updated in the longest time).

  • real_email_addresses if set to false will append ".fake" to all imported email addresses. (Note: Being able to get email addresses from the Facebook API seems to be rare, so count on having to collect and add them in some other way.)

Access Tokens for the Facebook Graph API

Any kind of import will require an access token to communicate with the Facebook Graph API. In all cases you will need to have Admin privileges in the group you are importing and in all cases you will enter you access token in the config file, i.e. import_facebook.yml.

There are two basic options for generating access tokens, each with specific benefits and drawbacks:

1. User Access Token

This is the simplest kind to generate and the one which gives the best kind of data from the API (see below). Recommended for importing small groups.

Drawback: Each access token is only valid for 1-2 hours, so to import a large group the script will have to be restarted many times with new access tokens. The script will exit cleanly when an access token expires and give you instructions on how to restart the import from the last post imported before exiting. Nevertheless, this requires a manual step being taken every couple of hours or so. If you have a group with lots of activity and long threads, you might be able to import less than 100 threads per access token, so a group with 5,000 threads might require you to restart the script with a new access token 50 times over the course of several days.

To generate a User Access Token, go to the Graph API Explorer:

https://developers.facebook.com/tools/explorer/

  • Click the Get Token button and then Get User Access Token

  • Make sure that the permission user_managed_groups is selected

  • Copy the access token generated into config/import_facebook.yml

  • If the token expires, reload the page to generate a new one

2. App Access Token

This access token is slightly more complicated to generate but has the benefit of being valid for about 60 days, which should be enough to import even very large groups in one single session.

Drawback: Objects returned by the API will in some cases have special IDs which are only valid for the same app for which the token was created. This means that if for some reason you can not generate a new access token from the same app and need to run the import again, bad things will happen. For example, user accounts which have already been imported will not be recognized and instead duplicates will be created. Additionally, you will probably need to violate the Facebook Developers terms of service to use this method, so Facebook could suspend your app, leading to the above mentioned issues when restarting.

Here is how to generate an App Access Token:

  • Create a Facebook App at https://developers.facebook.com

  • Under App Review, click Start a Submission (newly created apps) or Add Items (previously existing apps) and select user_managed_groups

  • Save changes to the App

  • Copy App ID and App Secret from e.g. Dashboard

  • Go to the Graph API Explorer: https://developers.facebook.com/tools/explorer/

    • Select your app in the top-right dropdown

    • Click Get Token and select Get User Access Token

    • Make sure that the permission user_managed_groups is selected

    • Copy the generated access token

  • Use the following URL format for the next step: https://graph.facebook.com/oauth/access_token?client_id=APP_ID&client_secret=APP_SECRET&grant_type=fb_exchange_token&fb_exchange_token=USER_ACCESS_TOKEN

  • Replace APP_ID, APP_SECRET, USER_ACCESS_TOKEN with the values collected in previous steps

  • Access the URL. In the JSON response you will find a new access token, copy it and paste it into config/import_facebook.yml

Notes on importing large groups

The largest group imported by this script at this point had about 5,000 top-level posts, 35,000 comments and 100,000 likes. Based on this experience, here are some suggestions for importing large groups, i.e. more than a few hundred top-level posts:

  • Importing a large group will take a lot of time. Deal with this by first running in test mode while downloading data and storing them to disk. Import to Discourse separately once exporting from Facebook is complete. For large groups (10,000+ comments) the complete process can take several days.

  • If using short-term access tokens the import will have to be restarted multiple times. When the access token expires the importer will exit and tell you the index number of the last post processed. Enter this number in the config file under restart_from_topic_number to skip ahead to the same place when restarting.

  • When frequently restarting the import of a large group, it is useful to reverse import order (see Options above) so the post order is not changed from time to time as new posts are made, i.e. the index numbers used for restarting from a particular place (see above).

  • For very large groups (100,000+ top-level posts) it might not be possible to fetch the entire set of top-level posts with one short-lived access token, i.e. it might take more than 1-2 hours (see below). This could be solved by using a longer-lasting access token or by rewriting the importer so the initial post import is done incrementally with partial results being saved to disk.

Importing to categories with restricted access

Due to a limitation in Discourse, the importer can only create topics in categories which are accessible by the group everyone. This is not a problem when doing an initial one-off import, but in some cases it makes continuous importing to an active Discourse forum impossible. There is a workaround although it requires patching Discourse. What is needed is changing a single line and you can find more information here:

discourse/discourse#4641

Give feedback and share your experiences

If you run into issues using this importer, please give feedback in the official thread on the Discourse forum. Also, please share your experiences there, in particular if you import large or otherwise unusual groups and learn something which might be of interest to others:

https://meta.discourse.org/t/import-posts-from-facebook-group-into-discourse/6089/76

Suggestions for further development

  • Shared files (attachments) other than images are not imported, e.g. uploaded PDF files. It is quite possible that this can be fixed and it might even be a quick fix.

  • Some shared links are not imported, most notably shares of Facebook posts. It would probably be possible to include more types of link shares in the imports after some digging around in various types of API responses.

  • Some images are not imported but perhaps a way can be found to solve this problem. Normally posts with attached images have their type field set to "photo". However, a small number of these posts instead have type "status" and in these cases I have not been able to find a way to retrieve the images. (Note: This issue seems to be rare, in one set of 3,000+ posts I found 9 posts of this type.)

  • Polls are not imported but it might be possible since Discourse supports polls through a plugin included in the default install. If someone wants to import a group which uses polls frequently, this might be a worthy undertaking.

  • As far as I can tell importing email addresses is not really useful since it is so rare that you actually get them from the API (depending on individual privacy settings, I suppose). So in practice you will always need to deal with user email addresses manually anyway. For this reason, the importer should probably assume that real emails are never fetched.

  • It seems that there is an option called skip_notifications which can be passed to the PostCreator. However, as far as I can see it is not actually used for anything (searching the code only returns a single occurrence, in app/models/topic.rb where it is passed to PostCreator when creating moderator posts) so I have not used it. However, if it does what it seems it should do, it would be useful. Someone could look into this and add it if appropriate.

  • Activity from some users can not be imported, probably due to privacy settings. I have not managed to figure out why but it would be great to know so users can be informed that they need to change their settings if they want their comments etc to be imported. If you do some testing and find the answer, please consider updating this readme file!

  • Ideally the initial fetching of top-level posts should be save incrementally to disk, to handle initial imports of very large groups (see above in Notes on importing large groups). If you actually need this feature, i.e. if you try to import a group but run into this limitation (script exits due to expired access token before initial post fetch is complete), drop a message here and someone just might want to help you out:

    https://meta.discourse.org/t/import-posts-from-facebook-group-into-discourse/6089

  • It is a pretty common situation that a particular network, e.g. a company, an interest group etc has several rather than a single Facebook group. In a couple of cases I am familiar with there are over 100 satellite groups around one or more main groups. For these cases it would be useful if the script could be made to take multiple group IDs, category names etc and process all of them in one go instead of restarting the importer after editing the config file for each group.

  • To avoid restarts due to expired access tokens, it would be nice if the importer reloaded the config file when an expired token is detected (there is already a method for this). This way, new access tokens could be entered ahead of time so the script would not need to be restarted manually, e.g. the operator could have a timer set for one hour to get a new token and update the config file.

About

This rake task will import all posts of a Facebook group into Discourse

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 100.0%