# Lecture 5

source: [565_quiz2 google doc](https://docs.google.com/document/d/1YRZ3dWxVRcuXfEFPX1JDEULiV7HXogjqG95gxf8Aooo/edit)
### Define an ordinal classification problem, and give an example
 * Definition: classification problem where the classes have some sort of order
    * It is a type of regression analysis used for predicting an ordinal variable
* Example: Amazon ratings / movie reviews / any opinions that are expressed on a scale

### Give 3 possible solutions to the rating classification using standard ML models, and mention a disadvantage for each
* Can use the probability associated with a binary classifier (predict_proba)
    * But is it (the probability) really capturing intensity or just reliability?
    * Unclear as to how to use neutral/ambivalent data (far from every class?)
* Or do a 5-class classification problem
    * Not taking advantage of the ordinal structure
* Or change it into a regression problem
    * Too fine-grained, wrong objective
        * 1.1, 1.2, 0.9, etc are not acceptable as 1

### Give the main idea of SVM-based ranking

* Julian: If A is ranked higher than B, you can create a positive example for your SVM rank classifier by subtracting B's feature vector from A, or a negative example for your classifier by subtracting A from B. When you have trained your binary classifier and have a weight vector W, you can get a value which can be used to rank any new document C (relative to others) by taking the dot product of W and the feature vector for C.

* Pairwise intuition
    * With ratings scales, reviews with 1 starts should be ranked lower than those with 2, 3, 4, and 5.
    * For all document d_0, d_1, …, d_N in D, if rating(d_i) > rating(d_k), output(d_i) > output(d_j) 
    * If we use a linear model, “output” is of the form w * pi(d_i) where pi() generates a feature vector for a document, so we can express our objective output(d_i) > output(d_j) as 
        * w* pi(d_i) > w*pi(d_j) 
        * = w*(pi(d_i) - pi(d_j)) > 0
	
    Which is a binary classification problem with new features vectors corresponding to pairwise differences of the original. 
    
        * If rating(d_i) > rating(d_k), result : positive
        * If rating(d_i) < rating(d_k) result : negative ⇒ binary classification


### SVM ranking

* SVM is a linear model (by default) and ideal for this kind of ranking problem
* It can handle huge set of training examples and a huge feature space
* We don’t need the output of `predict` (which gives positive or negative), just calculate dot products directly.
    * X_test.dot(svc.coef_[0]) # feature vector.dot(svc coefficients)
* This is useful for information retrieval as well

### Know how to convert feature vectors and ratings to prepare for SVM Rank

* For each datapoint, randomly select one other datapoint that has a different rating.
* Create a new feature vector which is the difference between the two data points(difference of feature vectors)
* Create a label which is 1 if the rating of the first one is larger, 0 if the second is larger. 

In [None]:
def convert_to_pairwise(data,ratings):
    pairwise_data = []
    pairwise_class = []
  
    for i in range(data.shape[0]):
        j = random.randint(0, data.shape[0]-1)
        
        # if the ratings are the same, do it again
        while ratings[i] == ratings[j]:
            j = random.randint(0, data.shape[0]-1)
        
        # stack diff between datapoints
        diff_data = data[i] - data[j]
        pairwise_data.append(diff_data)
        
        # add classes
        if ratings[i] > ratings[j]:
            pairwise_class.append(1)
        else:
            pairwise_class.append(0)

    output_pairwise_data = vstack(pairwise_data)
    return output_pairwise_data, pairwise_class

### Know how to calculate the score for a text after the SVM ranking model has been built
* Train a SVC model with the pairwise data & pairwise class.
* Dot product a feature vector with the coefficients of the SVC model trained on the pairwise data and labels
* Calculate Kendall’s Tau with predictions and the gold standard

### Given a pair of rankings (one gold standard, one predicted), calculate Kendall's Tau
(# of concordant pairs - # discordant pairs) / total # of pairs
A concordant pair = a pair where the order is the same
example
X1 = [1, 2, 3, 4, 5, 6, 7]
X2 = [1, 3, 2, 4, 5, 7, 6]
   * Concordant = 19
   * Discordant = 2
   * Total = 21
   * = (19-2) / 21 = 0.809

### Give one major problem with training a fake review classifier
* Fake fakes are a lot different from real fakes so they may not be as accurate
* Very difficult to get “the gold standard”
* Difficult to identify fake reviews too
* Generally metadata is much more promising than text for identifying fakes

### Define argumentation mining relative to other CL topics such as sentiment analysis and discourse

* The identification and segmentation of argumentative units
* The identification and classification of supporting and objecting units
* The identification and classification of argumentative structure
    * We could make use of these argumentation mining to analyze sentiments, polarities, etc of a text.

### Know the major indicators of sarcasm, and be able to identify instances using those indicators
* Exaggerations and hyperbole
* Use of interjections
* Incongruent punctuation
* Striking polarity conflict (hashtags / emojis that indicate the opposite sentiment of the content of the tweet)
* Obvious role-playing
* Fixed expressions (yeah, not so much)
* Explicit mention of sarcasm (#sarcasm)

### Know what information beyond an individual tweet/post is important for identifying sarcasm
* User profiling 
* Identify user stance toward relevant targets
* Identify user personality as determined by entire tweet history
* Identify user mood as determined by immediate tweet history

# Lecture 6

### Provide a basic schema for emotion classification (i.e. list 5/6/7 major emotions)
Surprise, happiness, anger, fear, disgust, and sadness, etc

### Define distant supervision
* Finding features that allow you to find labels that may not be perfectly accurate but accurate enough that you could train on
* We can build a classification problem where the training set comes from sources
* Use some information (such as hashtags, emojis) in a tweet as a “label” not a “feature”

### Explain how an emotion classification dataset can be derived using distant supervision
* Features on social media such as:
* Hashtags e.g. #Depressed
* Emoji  

### Know what is meant by the "big five" personality traits 
* “Big five” → five major human personalities
* Extroversion (vs. Introversion)
* Emotional stability (vs. Neuroticism)
* Agreeableness (vs. Disagreeableness)
* Conscientiousness (vs. Unconscientiousness)
* Openness (vs. Conservatism)  
                 
### Distinguish between an open vocabulary and a closed vocabulary approach → Schwatz

* "Closed vocabulary" using LIWC (content analysis)
    * Closed vocabulary approach means that using a lexicon which has predefined categories for each word.
    *  The latest version of LIWC has 64 categories such as articles, prepositions, … family, affect, body, etc.
    *  Something like:
        *  Neurotic and agreeable people tend to use more first-person singulars.
        * People low in openness talk more about social processes.
        *  Females use more first person singular pronouns, males use more articles.
    *  Most studies linking language with psychological variables rely on a priori fixed sets of words, such as the LIWC categories carefully constructed over 20 years of human research.
    *  When one has a specific theory in mind or a small sample size, this can be ideal.


* Open vocabulary
    * The words and clusters of words analyzed are determined by the data itself
        * Enable us to find correlations such as: the word ‘hug’ is positively correlated with “agreeableness” (which is not categorized in LIWC)
        * Extract words, phrases, and topics (automatically clustered sets of words) from millions of data, and find the language that correlates most with gender, age, and five factors or personality.
    * User-normalized frequency of words and phrases identified using PMI
    * LDA topic models
         * Build a topic model for the entire corpus
         * Calculate a probability of topic for a given subject p(t|s)
    *  Does not rely on predefined categories of each word, rather, it draws characteristics of word usages using PMI and LDA. 
    * Open vocabulary analysis yields further insights into the behavioral residue of personality types beyond those from a priori word-category based approaches, giving unanticipated results (correlations between unexpected category of linguistic features (e.g. Japanese cartoons, sports teams, etc) and personality, gender, or age).

* Models created with open vocabulary features outperformed those created based on LIWC features.
* A model which includes LIWC features on top of the open-vocabulary features does not result in any improvement suggesting that the open-vocabulary features are able to capture predictive information which fully supersedes LIWC

### Know the common computational linguistics techniques used for linguistic feature extraction in the [Schwatz et al study](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073791) [draw connections from paper to what we learned in class] [PMI and Topic modelling] 
* Emoticon-aware word tokenization 
* n-grams
* Lexical approaches
* Words and phrases extractions
    * N-grams
    * Extract phrases which have high informative value according to PMI (pointwise mutual information), a ratio of the joint-probability to the independent probability of observing the phrase. (keep words and phrases which are commonly used)
* Topic extractions
    * LDA: topics are a distribution of words
    * Produced 2000 topics in the paper
* Word Cloud visualization
    * Summarize results 
    * Represent distinguishing topics 

### Identify which are the features (independent variables) and which are the output/response/dependent variables for the correlation analysis in Schwartz et al, and contrast this with a "typical" machine learning technique in author profiling

* Using personality/age/gender as the input to predict the use of words instead of the other way around (usually words → predict personality, age, gender)
    * Including multiple variables in a single model helps eliminate spurious correlation
    * “Our correlational analysis produces a comprehensive list of the most distinguishing language features for any given attribute, words, phrases, or topics which maximally discriminate a given target variable.”

* Word/phrase/topic probability are the output/response/dependent variables
    * A different linear regression model for each word/phrase/topic
    * Correct for multiple tests (Bonferroni corrected significance levels)

* For a given word/phrase/topic, the coefficient associated with personality/age/gender for its model is its "correlation"
* Correlation analysis was conducted for distinguishing individual effects of each feature such as age, gender, personality. 



### With reference to the results of Schwartz at al, discuss how "traditional" sentiment analysis (i.e. positive and negative) relates to variation due to personality, ages, and gender discovered in that study.

* Used a lot of samples and kept only the significant features to drive results (vs. traditional study had limited sizes of samples -- which led to false discovery such as males use more emoticons)
* Ages 
    * Youngest: slang, emoticons, internet speak, school
    * 23-29: A couple internet speak, work appearing, beer, etc
    * ‘I’ decreases after age 22 → ‘we’
    * Limitation: rarity of older individuals

* Gender
    * Females use more psychological and social processes, emotion words, first-person singulars.
    * Males use more possessive words (my wife, my girlfriend)

* Personality
    * Based on a big five questionnaire
    * Extroverts: party, love you, boys, ladies vs. introverts: computer, internet, reading, anime, manga, japanese, etc
    * Openness: music, art, writing, dream, universe, soul
    * Some features were extracted which LIWC did not capture
    * Neuroticism: sick of  (negative words?)
    * Language related to emotional stability (low neuroticism)
        * Emotional stability → sports, vacation, beach, church, team, family time (active words, positive words related to emotional stability?)



# Lecture 7

### Know what information a geoJSON file contains (how is the map information stored?).

* Name, id, created_at, updated_at, geometry(Multipolygon object which is bounded by a bunch of points)
* Multipolygon method: centroid (return a point object), contains(returns a boolean)

### Identify situations where we might want to manipulate marker size when plotting points on a map
* Marker size to reflect the data you have (e.g. population, crime rate etc)
* Individual vs region

### Define a choropleth and distinguish it from simple plotting of points on maps

In [None]:
```
fig = px.choropleth(
    Prov_happy_df,  # identifiers in the geoJSON and the df must be aligned(“name”)
    geojson=ca_geojson,
    color="happiness", # column whose value will be used for colors
    locations="name", # location names
    projection="mercator", # a projection of the map to improve the look
    color_continuous_scale="sunset", # color scheme
    hover_data=["happiness", "name"], # a list of columns which should appear when hovering over
)
fig.update_geos(fitbounds="locations") # centre the plot to the relevant parts of the world
fig.show()

```

* Points: visualize individual points and their properties
* Choropleth: visualize regions and their properties
    * A map which is divided up into administrative regions with each region given a color based on some variable.

### Contrast the three major classes in the datetime module

* Datetime: Date + time  (indicate an exact moment in time)
	`datetime.datetime(2020,3,16,6,30,1)`
    
	Methods: ctime()

* Date: only date 
	`datetime.date(2020,3,16)`

	Methods: replace(year=,month=,day=)

* Time: only time
	 `datetime.time(6,30,1)`

	 Methods: replace(hour=,minute=,second=)

### Explain how timezones work in the datetime module
* Switch the timezone

In [None]:
```
from pytz import timezone

datetime_string1 = "9:30 PM, Mar 16, 2021"
format_string1 ="%I:%M %p, %b %d, %Y"
dt = datetime.datetime.strptime(datetime_string1,format_string1)

pacific_time = timezone("Canada/Pacific")
localized_dt = pacific_time.localize(dt)    # convert native time to local time

localized_dt.astimezone(timezone('Canada/Eastern')) # convert local time to native time
```

# Lecture 8

### Define code-switching and explain why code-switching is a challenge for NLP
* CS is switching between languages in the same sentence
* It’s challenging for NLP because:
    * OOV words if it’s not in our monolingual corpora
        * Even if it’s in the corpora, it may be using the word in a different context, which makes it difficult to build good embeddings
    * Hard to identify syntax structure
    * Encoding problems with different characters 
* Two languages in one sentence; no word2vec, OOV problem, ...etc 

### Know that emoji have a corresponding text string, and how it can be used for sentiment analysis
* 🤐 :zipper-mouth_face:
* This can give good sentiment because the emoji can now be encoded in the BOW or whatever sentiment classifier. 

### Compare and contrast emoji and emoticons in terms of their computational representation and distribution in corpora.
* Both usually:
    * Appear at the end of sentence (substitutes punctuations)
    * Can be used multiple time for intensification
    * Usually redundant or are usually nouns
    * Express emotions
* Differences:
    * Emoticons are created from standard ascii punctuation instead of having their own unicode characters
    * Emoticons are less varied, and don’t have as many nouns

### Identify features of text on twitter or other social media platforms that make it distinct from regular published text (e.g. news).
* Emoji, Emoticons, Hashtags, Addressing (using @),
* Acronym = lol, idk
* Textspeak (non-acronym) shorter words involves orthographic variants based on the phonetic properties of a word (creative spelling)
    * EG: Gr8, 2day, b4
* Misspellings (non-textspeak)
    * Elongations 
        * EG: Soooooo 
    * Clippings/shorternings
        * EG: Tho, kno, cuz
    * Conjoined: words squished together without a space
        * E.g. #thisiswhativewaitedfor

### Discuss how these sort of features can result in errors in NLP systems if special steps are not taken.
* For misspellings, like elongations, if it’s not converted back to a more standard form, the parser can miss important information. 
    * EG: A word like “soooooo long” would be OOV since “soooooooo” isn’t in the vocab. 

### Provide possible solutions for problems stemming from the special features of social media.
* Packages like ekphrasis can annotate special features, like “elongated”, “hashtag”, “repeated”, to help encode these additional information
* Bigram / trigrams could help give more context around special features
* Spelling correctors, but it would have to be very advanced and cover a wide corpus of misspellings. 

### Understand why, in the context of sentiment analysis or author profiling, why it might be a bad idea to simply remove the features that cause trouble.

* Removing these important features would mean removing certain amplifications.
    * EG: For example “sooooo long”, if we remove “soooooo” - then we lose the amplification, which can change sentiment from very negative to negative
    * Cool vs Coooooooool (they should be different)