Skip to content
Gonzalo S. Pla edited this page Apr 26, 2016 · 18 revisions

Welcome to the Data Science Master Final Project page!

This work explores some Twitter platform features through its API that allows accessing Twitter programatically to any application able to connect to the Twitter webservices. The main tool choosen for this task is R.

R is a great tool for data manipulation, analysis and visualization. R is used by scientists, mathematicians, statisticians on their day-to-day work. Private companies, gubernamental organisms and individuals use R as main analytic tool due to its reliability, power, applications and features, and because it is completely free to use for any purpose. No license purchase is required and the R users community offers completely at no charge a huge collection of extensions that makes R one of the top one data analysis and scientific tools at present.

For those of us that grew up programming popular languages like C, Basic, Assembler, Java, C#, Python, php or even Javascript, not to say SQL; R syntax looks like at first sight unique among the others, heterogeneous, obscure, lacking kind of standard-way-of-doing-stuff. In my opinion the rewards for the dedication conducting the needed learning greatly exceed the effort required. Yes, it's true: "no pain no gain".

The project is about Data Science with either small or big data. My chosen option is not based mainly on the data size but on the data statistical significance and upon medium-sized datasets manipulation in second place. For this reason we will be dealing with a few gigabytes of data.

The Twitter API data comes in JSON format which is supported by R (and it is the basis data format of MongoDB, a non-relational database suited for big data projects). While for this project the data is transformed into raw text-delimited and to the R binary formats, the reader may consider output it straight onto a database like MongoDB. Such approach would require UTF-8 previous validation via an intermediate process which the developer should implement.

Additional R libraries will be used such as streamR, rCharts, ggplot2, caret, tm or data.table. The specific data analysis libraries will be introduced later within the project.

Let's take a look at the essentials of this project:

Scope:

  • We want to deploy something that may be of economic or social interest, this project should therefore be designed as a mean to fulfill a particular social or economic objective.

The research:

  • Several gigabytes of random tweets are extracted from Twitter and analyzed using statistical and text mining methods to understand the data and to stablish the most popular topics to the people who write. This research concludes by setting that jobs is among the top items among the Twitter users.

The question:

  • How can we deliver a product to the Twitter users such that they find it useful and that could represent an investment opportunity?

The answer:

  • An online web site or API that provides a twitter message stream, similar to a classified ads service. The user does not enter the search terms. The stream is produced by a fixed query optimized to get job offers. The results will be filtered by a data mining gradient boosting machine module.

The visualization:

  • Layouts, menus, charts, tables and intuitive controls are included to make it easy to utilize and data clear to understand.
  • The visualization includes both the "fixed" query to get the actual stream and a free-text option to explore its features with any topic.

The source code: All is uploaded to Github and commented. It is not optimized but conceived as a prototype.