# CS579: Lecture 12  

** Demographic Inference I**

*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)*  
*[Illinois Institute of Technology](http://iit.edu)*

# Midterm review 

~5 True/False, ~6 short answer
 
**Topics:** 
- Twitter API 
 - what comes in a tweet? 
 - how do rate limits work? 
 - can you understand API documentation? 
- graph basics: 
 - directed/undirected 
 - path 
 - cycle 
 - connected 
 - connected component 
 - degree (distribution) 
 - diameter 
 - average path length 
 - clustering coefficient 
- modeling networks 
 - random graphs 
 - regular graphs 
 - rewired graphs 
 - what makes a small world? 
- community detection 
 - girvan-newman (betweenness) 
 - graph cuts 
 - representing graphs with matrices 
 - graph laplacian 
- link prediction 
 - shortest path 
 - common neighbors 
 - jaccard 
 - preferential attachment 
 - sim rank 
 - evaluation 
- information cascades 
 - urn experiment 
 - bayes' theorem for decision making 
 - game-theoretic model 
 - maximizing payoff 
	- cluster density 
- sentiment analysis 
 - lexicon approach 
 - machine learning 
 
**Question types:** 
- What does this algorithm output? 
 - E.g., what is jaccard score for a specific link? 
 - E.g., what is the next step in girvan-newman? 
- What does this code do? 
 - E.g., I give you a new graph-generating algorithm, tell me what it produces 
- Write a new algorithm 
 - E.g., provide pseudo-code the linear-threshold cascade model 
- True/False 
 - E.g., small world graphs have higher clustering coefficients than random graphs.

**dem·o·graph·ics**

statistical data relating to the population and particular groups within it.

E.g., age, ethnicity, gender, income, ...

# Why Demographics?

- Marketing
  - Who are my customers?
  - Who are my competitors' customers?
  - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)
  
- Social Media as Surveys
  - E.g., 45% of tweets express positive sentiment toward Pres. Obama
  - Who wrote those tweets?
  
- Health
  - 2% of Facebook users are expressing flu-like symptoms
  - Are they representative of the full population?



** User profiles vary from site to site. **

![rahm](rahm.png)

![rahm-fb](rahm-fb.png)

![rahm-li](rahm-li.png)

# Approaches

- Clever use of external data
  - E.g., U.S. Census name lists for gender
- Look for keywords in profile
  - "African American Male"
  - "Happy 21st birthday to me"
- Machine Learning

In [78]:
# Guessing gender
import configparser
import sys
from TwitterAPI import TwitterAPI

def get_twitter(config_file):
    """ Read the config_file and construct an instance of TwitterAPI.
    Args:
      config_file ... A config file in ConfigParser format with Twitter credentials
    Returns:
      An instance of TwitterAPI.
    """
    config = configparser.ConfigParser()
    config.read(config_file)
    twitter = TwitterAPI(
                   config.get('twitter', 'consumer_key'),
                   config.get('twitter', 'consumer_secret'),
                   config.get('twitter', 'access_token'),
                   config.get('twitter', 'access_token_secret'))
    return twitter


def robust_request(twitter, resource, params, max_tries=5):
    """ If a Twitter request fails, sleep for 15 minutes.
    Do this at most max_tries times before quitting.
    Args:
      twitter .... A TwitterAPI object.
      resource ... A resource string to request.
      params ..... A parameter dictionary for the request.
      max_tries .. The maximum number of tries to attempt.
    Returns:
      A TwitterResponse object, or None if failed.
    """
    for i in range(max_tries):
        request = twitter.request(resource, params)
        if request.status_code == 200:
            return request
        else:
            print('Got error:' + request.text + '\nsleeping for 15 minutes.')
            sys.stderr.flush()
            time.sleep(2)

twitter = get_twitter('twitter.cfg')
#request = robust_request(twitter, 'search/tweets',
#                                  {'q': 'yesterday',
#                                  'count': '100'})
tweets = []
n_tweets=1000
for r in twitter.request('statuses/filter', {'track': 'i'}):
    tweets.append(r)
    if len(tweets) % 100 == 0:
        print('%d tweets' % len(tweets))
    if len(tweets) >= n_tweets:
        break
print('fetched %d tweets' % len(tweets))

100 tweets
200 tweets
300 tweets
400 tweets
500 tweets
600 tweets
700 tweets
800 tweets
900 tweets
1000 tweets
fetched 1000 tweets


In [87]:
# not all tweets are returned
# https://dev.twitter.com/streaming/overview/messages-types#limit_notices
[t for t in tweets if 'user' not in t][:6]

[{'limit': {'timestamp_ms': '1475512436800', 'track': 67}},
 {'limit': {'timestamp_ms': '1475512436802', 'track': 80}},
 {'limit': {'timestamp_ms': '1475512436812', 'track': 73}},
 {'limit': {'timestamp_ms': '1475512436856', 'track': 97}},
 {'limit': {'timestamp_ms': '1475512437801', 'track': 173}},
 {'limit': {'timestamp_ms': '1475512437816', 'track': 149}}]

In [88]:
# restrict to actual tweets
# (remove "deleted" tweets)
tweets = [t for t in tweets if 'user' in t]
print('fetched %d tweets' % len(tweets))

fetched 926 tweets


In [90]:
# Print 10 names.
names = [t['user']['name'] for t in tweets]
names[:10]

['Ale Loves Louis◟̽◞̽',
 'Nicole ♡',
 'a r b y o n c é ✨',
 'Silvia Lozza',
 'Hope Ent',
 'Mark Tuan Trash',
 'Jenna Mears Ⓥ',
 'Goddess',
 'OurWorldChange',
 'char❤️']

In [91]:
# Fetch census name data from:
# http://www.census.gov/genealogy/www/data/1990surnames/index.html

import requests
males_url = 'http://www2.census.gov/topics/genealogy/' + \
            '1990surnames/dist.male.first'
females_url = 'http://www2.census.gov/topics/genealogy/' + \
              '1990surnames/dist.female.first'
males = requests.get(males_url).text.split('\n')
females = requests.get(females_url).text.split('\n')
print('males:\n' + '\n'.join(males[:10]))
print('\nfemales:\n' + '\n'.join(females[:10]))

males:
JAMES          3.318  3.318      1
JOHN           3.271  6.589      2
ROBERT         3.143  9.732      3
MICHAEL        2.629 12.361      4
WILLIAM        2.451 14.812      5
DAVID          2.363 17.176      6
RICHARD        1.703 18.878      7
CHARLES        1.523 20.401      8
JOSEPH         1.404 21.805      9
THOMAS         1.380 23.185     10

females:
MARY           2.629  2.629      1
PATRICIA       1.073  3.702      2
LINDA          1.035  4.736      3
BARBARA        0.980  5.716      4
ELIZABETH      0.937  6.653      5
JENNIFER       0.932  7.586      6
MARIA          0.828  8.414      7
SUSAN          0.794  9.209      8
MARGARET       0.768  9.976      9
DOROTHY        0.727 10.703     10


In [92]:
# Get names. 
male_names = set([m.split()[0].lower() for m in males if m])
female_names = set([f.split()[0].lower() for f in females if f])
print('%d male and %d female names' % (len(male_names), len(female_names)))
print('males:\n' + '\n'.join(list(male_names)[:10]))
print('\nfemales:\n' + '\n'.join(list(female_names)[:10]))

1219 male and 4275 female names
males:
jordon
ricardo
emmitt
kelly
jasper
maxwell
robt
errol
milan
santiago

females:
cinderella
kelly
jenise
dominica
mariella
adelaida
tobi
iola
aiko
dorine


In [93]:
def gender_by_name(tweets, male_names, female_names):
    for t in tweets:
        name = t['user']['name']
        if name:
            first = name.split()[0].lower()
            if first in male_names:
                t['gender'] = 'male'
            elif first in female_names:
                t['gender'] = 'female'
            else:
                t['gender'] = 'unknown'

gender_by_name(tweets, male_names, female_names)

In [94]:
from collections import Counter

def print_genders(tweets):
    print('gender counts:\n', Counter([t['gender'] for t in tweets]))
    for t in tweets[:20]:
        print(t['gender'], t['user']['name'])
    
print_genders(tweets)

gender counts:
 Counter({'unknown': 643, 'female': 156, 'male': 127})
unknown Ale Loves Louis◟̽◞̽
female Nicole ♡
unknown a r b y o n c é ✨
female Silvia Lozza
female Hope Ent
male Mark Tuan Trash
female Jenna Mears Ⓥ
unknown Goddess
unknown OurWorldChange
unknown char❤️
male Kim Châu
unknown Nia🚺
male eduardo reyes
male Keith Bigglesworth
female Brooke ❁
unknown Vickywadhwani
male todd
male Timmy Turner
unknown Frenchy
female KAY


In [95]:
# What about ambiguous names?

def print_ambiguous_names(male_names, female_names):
    ambiguous = [n for n in male_names if n in female_names]
    print('found %d ambiguous names:\n'% len(ambiguous))
    print('\n'.join(ambiguous[:20]))
    
print_ambiguous_names(male_names, female_names)

found 331 ambiguous names:

kelly
drew
jude
joshua
trinidad
shannon
son
robbie
terry
lou
dee
tyler
sandy
kim
ira
troy
tristan
shawn
scott
russell


In [96]:
# Keep names that are more frequent in one gender than the other.
males_pct = dict([(m.split()[0].lower(), float(m.split()[1]))
                  for m in males if m])
females_pct = dict([(f.split()[0].lower(), float(f.split()[1]))
                    for f in females if f])

male_names = set([m for m in male_names if m not in female_names or
              males_pct[m] > females_pct[m]])
female_names = set([f for f in female_names if f not in male_names or
              females_pct[f] > males_pct[f]])

print_ambiguous_names(male_names, female_names)
print('%d male and %d female names' % (len(male_names), len(female_names)))

found 0 ambiguous names:


1146 male and 4017 female names


In [97]:
gender_by_name(tweets, male_names, female_names)
print_genders(tweets)

gender counts:
 Counter({'unknown': 643, 'female': 173, 'male': 110})
unknown Ale Loves Louis◟̽◞̽
female Nicole ♡
unknown a r b y o n c é ✨
female Silvia Lozza
female Hope Ent
male Mark Tuan Trash
female Jenna Mears Ⓥ
unknown Goddess
unknown OurWorldChange
unknown char❤️
female Kim Châu
unknown Nia🚺
male eduardo reyes
male Keith Bigglesworth
female Brooke ❁
unknown Vickywadhwani
male todd
male Timmy Turner
unknown Frenchy
female KAY


In [98]:
# Who are the unknowns?
# "Filtered" data can have big impact on analysis.

unknown_names = Counter(t['user']['name']
                        for t in tweets if t['gender'] == 'unknown')
unknown_names.most_common(20)

[('Kelsy Brown', 6),
 ('♡', 3),
 ('ㅤ', 3),
 ('Vitto', 2),
 ('C', 2),
 ('.', 2),
 ('|•|H A R R Y|•|', 2),
 ('Lucilene Pereira', 1),
 ('Alli Garris', 1),
 ('Roshell🌹', 1),
 ('~Natalie~', 1),
 ('liam', 1),
 ('BowTieKai', 1),
 ('wes¡', 1),
 ('B', 1),
 ('Yandy🍂// pinned', 1),
 ("Clay'", 1),
 ('Wambui Kihugi', 1),
 ('Matthew⚾️', 1),
 ('mer', 1)]

In [100]:
# How do the profiles of male Twitter users differ from
# those of female users?

male_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'male']

female_profiles = [t['user']['description'] for t in tweets
                if t['gender'] == 'female']
#male_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'male']

#female_profiles = [t['text'] for t in tweets
#                if t['gender'] == 'female']

import re
def tokenize(s):
    return re.sub('\W+', ' ', s).lower().split() if s else []

male_words = Counter()
female_words = Counter()

for p in male_profiles:
    male_words.update(Counter(tokenize(p)))
                      
for p in female_profiles:
    female_words.update(Counter(tokenize(p)))

print('Most Common Male Terms:\n' + \
    '\n'.join(str(x) for x in male_words.most_common(10)))
    
print('\nMost Common Female Terms:\n' + \
    '\n'.join(str(x) for x in female_words.most_common(10)))

Most Common Male Terms:
('i', 25)
('the', 21)
('a', 17)
('and', 16)
('to', 13)
('of', 13)
('t', 11)
('for', 11)
('my', 10)
('you', 10)

Most Common Female Terms:
('i', 30)
('and', 22)
('the', 21)
('a', 18)
('my', 18)
('of', 14)
('be', 10)
('m', 10)
('me', 9)
('you', 9)


In [101]:
print(len(male_words))
print(len(female_words))

644
686


In [102]:
# Compute difference
diff_counts = dict([(w, female_words[w] - male_words[w])
                    for w in
                    set(female_words.keys()) | set(male_words.keys())])

sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])

print('Top Male Terms (diff):\n' + \
    '\n'.join(str(x) for x in sorted_diffs[:10]))

print('\nTop Female Terms (diff):\n' + \
    '\n'.join(str(x) for x in sorted_diffs[-10:]))

Top Male Terms (diff):
('it', -8)
('for', -7)
('as', -6)
('sports', -4)
('games', -4)
('own', -4)
('time', -4)
('so', -4)
('t', -4)
('to', -4)

Top Female Terms (diff):
('perfect', 3)
('like', 4)
('me', 4)
('16', 5)
('i', 5)
('and', 6)
('sc', 7)
('my', 8)
('ig', 8)
('be', 9)


** A problem with difference of counts:**

<br><br><br><br>
What if we have more male than female words in total?

*Solution:* ** Odds Ratio (OR)**

$$ OR(w) = \frac{p(w|\hbox{female})}{p(w|\hbox{male})} $$

$$p(w|\hbox{female}) = \frac{\hbox{freq}(w, \hbox{female})}
{\sum_i \hbox{freq}(w_i, \hbox{female})} $$

In [103]:
def counts_to_probs(gender_words):
    """ Compute probability of each term according to the frequency
    in a gender. """
    total = 1. * sum(gender_words.values())
    return dict([(word, count / total)
                 for word, count in gender_words.items()])

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)
print(sorted(male_probs.items(), key=lambda x: -x[1])[:10])

[('i', 0.0255885363357216), ('the', 0.021494370522006142), ('a', 0.017400204708290685), ('and', 0.016376663254861822), ('to', 0.01330603889457523), ('of', 0.01330603889457523), ('t', 0.011258955987717503), ('for', 0.011258955987717503), ('you', 0.01023541453428864), ('my', 0.01023541453428864)]


In [104]:
def odds_ratios(male_probs, female_probs):
    return dict([(w, female_probs[w] / male_probs[w])
                 for w in set(male_probs) | set(female_probs)])

ors = odds_ratios(male_probs, female_probs)

KeyError: 'enjoy'

In [105]:
print(len(male_probs))
print(len(female_probs))

644
686


** How to deal with 0-probabilities? **

$$p(w|\hbox{male}) = \frac{\hbox{freq}(w, \hbox{male})}
{\sum_i \hbox{freq}(w_i, \hbox{male})} $$

$\hbox{freq}(w, \hbox{male}) = 0$

Do we really believe there is **0** probability of a male using this term?


** Additive Smoothing **

Reserve small amount of counts (e.g., 1) for unseen observations.

E.g., assume we've seen each word at least once in each class.

$$p(w|\hbox{male}) = \frac{\hbox{1 + freq}(w, \hbox{male})}
{|W| + \sum_i \hbox{freq}(w_i, \hbox{male})} $$

$|W|$: number of unique words.

In [106]:
# Additive smoothing. Add count of 1 for all words.
all_words = set(male_words) | set(female_words)
male_words.update(all_words)  
female_words.update(all_words)

male_probs = counts_to_probs(male_words)
female_probs = counts_to_probs(female_words)
print('\n'.join(str(x) for x in 
                sorted(male_probs.items(), key=lambda x: -x[1])[:10]))

('i', 0.012025901942645698)
('the', 0.010175763182238668)
('a', 0.008325624421831638)
('and', 0.00786308973172988)
('to', 0.0064754856614246065)
('of', 0.0064754856614246065)
('for', 0.005550416281221091)
('t', 0.005550416281221091)
('my', 0.005087881591119334)
('you', 0.005087881591119334)


In [107]:
ors = odds_ratios(male_probs, female_probs)

sorted_ors = sorted(ors.items(), key=lambda x: -x[1])

print('Top Female Terms (OR):\n' + \
    '\n'.join(str(x) for x in sorted_ors[:20]))

print('\nTop Male Terms (OR):\n' + \
    '\n'.join(str(x) for x in sorted_ors[-20:]))

Top Female Terms (OR):
('ig', 8.800542740841248)
('sc', 7.822704658525554)
('16', 5.867028493894165)
('be', 5.378109452736318)
('like', 4.889190411578471)
('messy', 3.911352329262777)
('nothing', 3.911352329262777)
('er', 3.911352329262777)
('perfect', 3.911352329262777)
('rainbows', 2.9335142469470825)
('home', 2.9335142469470825)
('art', 2.9335142469470825)
('walker', 2.9335142469470825)
('livet', 2.9335142469470825)
('still', 2.9335142469470825)
('where', 2.9335142469470825)
('nerd', 2.9335142469470825)
('easy', 2.9335142469470825)
('singer', 2.9335142469470825)
('18', 2.9335142469470825)

Top Male Terms (OR):
('songs', 0.32594602743856477)
('bz', 0.32594602743856477)
('covering', 0.32594602743856477)
('up', 0.32594602743856477)
('an', 0.32594602743856477)
('shit', 0.32594602743856477)
('fan', 0.32594602743856477)
('1', 0.32594602743856477)
('nfl', 0.24445952057892356)
('we', 0.24445952057892356)
('professional', 0.24445952057892356)
('defend', 0.24445952057892356)
('king', 0.244459