This repository contains all the codes pertaining to "Code mixing patterns in Celebrities" / "Quantifying Sense Deviation in Twitter". (Project 16) by students of group 6 as a part of Speech & Natural Language Processing course (CS60057), Autumn 2017.
- Kaustubh Hiware - 14CS30011
- G. Prithvi Raj Reddy - 14CS10016
- Kiran Sing Sastry G. - 14CS10018
- T. Karthik - 14CS10049
- Surya M. - 14CS30017
Our work was largely classified into 3 tasks:
Task 1: Tagging and Formatting
For every tweet in our dataset, we classify each word and assign a word-level tag and phrase(matrix)-level tag. Apart from this, each tweet is tagged as En / Hi / Code-switched / Code-mix-En / Code-mix-Hi / Code-mix-Equal / Other. This will be used later. Please refer to Task_1_Formatting for further details.
Task 2: Dataset Analysis
Using the paper All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media, we shall be computing UUR values, UTR and UPR values. To check the efficiency of our tagging, we shall use Jaccard coefficient and Spearmann coefficients as evalution measures. Refer to Task_2_DatasetAnalysis for further details.
Task 3: Sense Deviation
Referring to the techniques used in Hamilton’s (2016): Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change and Analyzing Semantic Changes in Japanese Loanwords to understand what senses an English word when used in Hindi context is used in context of social media. Further details are mentioned in Task_3_SenseDeviation.
The results for each part are mentioned in the corresponding task. For a more comprehensive study of results, please refer to Group6_report.pdf.
Report and Slides
NOTE: Due to space restrictions, we could not upload everything on GitHub. All the code and data can also be found on our mentor Jasabanta Patro's server. [in CelebrityCodeMixingTermProject directory]