# 01 Introduction

In [3]:
# users ids and names
users = [
    {'id': 0, 'name': 'Hero'},
    {'id': 1, 'name': 'Dunn'},
    {'id': 2, 'name': 'Sue'},
    {'id': 3, 'name': 'Chi'},
    {'id': 4, 'name': 'Thor'},
    {'id': 5, 'name': 'Clive'},
    {'id': 6, 'name': 'Hicks'},
    {'id': 7, 'name': 'Devin'},
    {'id': 8, 'name': 'Kate'},
    {'id': 9, 'name': 'Klein'}
]

In [4]:
# friendship data of users
frienship_pairs = [
    (0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
    (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9),
]

Having frienships representas as a lsit of pairs is not the easiest
way to work with them. To find all the friendshis for user1, you have to iterate
over every pair looking for pairs containing 1. If you had a lot of pairs,
this would take long time.

Instead, let's create a `dict` where the keys are user `ids` and the valies
are lists of frien `ids`.

We'll still have to look at every pair to create the `dict`, but we only have to
do that once, and we'll get cheap lookups after that:

In [5]:
# initiliaze the dict with an empty lust for each user id:
friendships = {user['id']: [] for user in users}

# and loop over the frienship pairs to populate it:
for i, j in frienship_pairs:
    friendships[i].append(j)
    friendships[j].append(i)

Now we can ask ourselves some questions like:

"Whats the average number of connections?"

First we find the toal numbe rof connections, by summing up the lengths
of the `friends` lists

In [6]:
def number_of_friends(user):
    """How many friends does `user` have?"""
    user_id = user['id']
    friend_ids = friendships[user_id]
    return len(friend_ids)

total_connections = sum(number_of_friends(user) for user in users)

In [7]:
# then we divide by the number of users
num_users = len(users)
avg_connections = total_connections / num_users
avg_connections

2.4

It’s also easy to find the most connected people—they’re the people who have the largest numbers of friends.
Since there aren’t very many users, we can simply sort them from “most friends” to “least friends”:

In [8]:
# create a list (user_id, nomber_of_friends)
num_friends_by_id = [(user['id'], number_of_friends(user)) for user in users]
num_friends_by_id.sort(
    key = lambda id_and_friends: id_and_friends[1],
    reverse = True
)

num_friends_by_id

[(1, 3),
 (2, 3),
 (3, 3),
 (5, 3),
 (8, 3),
 (0, 2),
 (4, 2),
 (6, 2),
 (7, 2),
 (9, 1)]

One way to think of what we’ve done is as a way of identifying people who are somehow central to the network.
In fact, what we’ve just computed is the network metric `degree centrality`

### Design a “Data Scientists You May Know” suggester.

Your __first instinct__ is to suggest that users might know the friends of their friends. So you write some code to iterate over their friends and collect the friends’ friends:

In [9]:
def foaf_ids_bad(user):
    """foaf is short for friend of a friend"""
    return [foaf_id for friend_id in friendships[user['id']]
                        for foaf_id in friendships[friend_id]]

In [10]:
# call for id = 0, name: 'Hero'
foaf_ids_bad(users[0])

[0, 2, 3, 0, 1, 3]

It includes user 0 twice, since Hero is indeed friends with both of his friends. It includes users 1 and 2, although they are both friends with Hero already. And it includes user 3 twice, as Chi is reachable through two different friends:


In [11]:
print(friendships[0])
print(friendships[1])
print(friendships[2])

[1, 2]
[0, 2, 3]
[0, 1, 3]


Knowing that people are friends of friends in multiple ways seems like interesting information, so maybe instead we should produce a count of mutual friends. And we should probably exclude people already known to the user:

In [12]:
from collections import Counter

def friends_of_friends(user):
    user_id = user['id']
    return Counter(
        foaf_id for friend_id in friendships[user_id]
                    for foaf_id in friendships[friend_id]
                        if foaf_id != user_id
                           and foaf_id not in friendships[user_id]
    )

print(friends_of_friends(users[3]))

Counter({0: 2, 5: 1})


As a data scientist, you know that you also might enjoy meeting users with similar interests. (This is a good example of the “substantive expertise” aspect of data science.) After asking around, you manage to get your hands on this data, as a list of pairs :

In [13]:
interests = [
    (0, 'Hadoop'), (0, 'Big Data'), (0, 'HBase'), (0, 'Java'),
    (0, 'Spark'), (0, 'Storm'), (0, 'Cassandra'),
    (1, 'NoSQL'), (1, 'MongoDB'), (1, 'Cassandra'), (1, 'HBase'),
    (1, 'Postgres'), (2, 'Python'), (2, 'scikit-learn'), (2, 'scipy'),
    (2, 'numpy'), (2, 'statsmodels'), (2, 'pandas'), (3, 'R'), (3, 'Python'),
    (3, 'statistics'), (3, 'regression'), (3, 'probability'),
    (4, 'machine learning'), (4, 'regression'), (4, 'decision trees'),
    (4, 'libsvm'), (5, 'python'), (5, 'R'), (5, 'Java'), (5, 'C++'),
    (5, 'Haskell'), (5, 'programming languages'), (6, 'statistics'),
    (6, 'probability'), (6, 'mathematics'), (6, 'theory'),
    (7, 'machine learning'), (7, 'scikit-learn'), (7, 'mahout'),
    (7, 'neural networks'), (8, 'neural networks'), (8, 'deep learning'),
    (8, 'big data'), (8, 'artificial intelligence'), (9, 'Hadoop'),
    (9, 'Java'), (9, 'MapReduce'), (9, 'big data')
]

In [14]:
# its easy to build a function that finds ysers with a cerain interest
def data_scientists_who_like(target_interest):
    """Find the ids of all usrs who like the target interes"""
    return [user_id for user_id, user_interest in interests
                        if user_unterest == target_interest]

This works, but it has to examine the whole list of interests for every search. If we have a lot of users and interests (or if we just want to do a lot of searches), we’re probably better off building an index from interests to users:

In [15]:
from collections import defaultdict

# keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)

for user_id, interest in interests:
    user_ids_by_interest[interest].append(user_id)

user_ids_by_interest

defaultdict(list,
            {'Hadoop': [0, 9],
             'Big Data': [0],
             'HBase': [0, 1],
             'Java': [0, 5, 9],
             'Spark': [0],
             'Storm': [0],
             'Cassandra': [0, 1],
             'NoSQL': [1],
             'MongoDB': [1],
             'Postgres': [1],
             'Python': [2, 3],
             'scikit-learn': [2, 7],
             'scipy': [2],
             'numpy': [2],
             'statsmodels': [2],
             'pandas': [2],
             'R': [3, 5],
             'statistics': [3, 6],
             'regression': [3, 4],
             'probability': [3, 6],
             'machine learning': [4, 7],
             'decision trees': [4],
             'libsvm': [4],
             'python': [5],
             'C++': [5],
             'Haskell': [5],
             'programming languages': [5],
             'mathematics': [6],
             'theory': [6],
             'mahout': [7],
             'neural networks': [7, 8],
           

In [16]:
# another from users to interests

# keys are user_ids, values are lists of interests for that user_id
interests_by_user_id = defaultdict(list)
for user_id, interest in interests:
    interests_by_user_id[user_id].append(interest)
    
interests_by_user_id

defaultdict(list,
            {0: ['Hadoop',
              'Big Data',
              'HBase',
              'Java',
              'Spark',
              'Storm',
              'Cassandra'],
             1: ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'],
             2: ['Python',
              'scikit-learn',
              'scipy',
              'numpy',
              'statsmodels',
              'pandas'],
             3: ['R', 'Python', 'statistics', 'regression', 'probability'],
             4: ['machine learning', 'regression', 'decision trees', 'libsvm'],
             5: ['python',
              'R',
              'Java',
              'C++',
              'Haskell',
              'programming languages'],
             6: ['statistics', 'probability', 'mathematics', 'theory'],
             7: ['machine learning',
              'scikit-learn',
              'mahout',
              'neural networks'],
             8: ['neural networks',
              'deep learning',
       

Now its easy to fin who has the most interests in common with a given user:
- iterate over the user's interests
- For each interest, iterate over the other users with that interest
- leep count of how many times we see each other user


In [17]:
def most_common_interests_with(user):
    return Counter(
        interested_user_id for interest in interests_by_user_id[user['id']]
                               for interested_user_id in user_ids_by_interest[interest]
                                   if interested_user_id != user['id']
    )

We could then use this to build a richer “Data Scientists You May Know” feature based on a combination of mutual friends and mutual interests.

### Salaries and Experience

Right as you’re about to head to lunch, the VP of Public Relations asks if you can provide some fun facts about how much data scientists earn. Salary data is of course sensitive, but he manages to provide you an anonymous dataset containing each user’s (in dollars) and as a data scientist (in years):

In [18]:
salaries_and_tenures = [
    (83_000, 8.7), (88_000, 8.1),
    (48_000, 0.7), (76_000, 6),
    (69_000, 6.5), (76_000, 7.5),
    (60_000, 2.5), (83_000, 10),
    (48_000, 1.9), (63_000, 4.2)
]

Let's look at the average salary for each tenure

In [20]:
# keys are years, values are lists of the salaries for each tenure
salary_by_tenure = defaultdict(list)

for salary, tenure in salaries_and_tenures:
    salary_by_tenure[tenure].append(salary)
    
# keys are years, each value is average saalry for that tenure
average_salary_by_tenure = {
    tenure: sum(salaries) / len(salaries) for tenure, salaries in salary_by_tenure.items()
}

average_salary_by_tenure

{8.7: 83000.0,
 8.1: 88000.0,
 0.7: 48000.0,
 6: 76000.0,
 6.5: 69000.0,
 7.5: 76000.0,
 2.5: 60000.0,
 10: 83000.0,
 1.9: 48000.0,
 4.2: 63000.0}

This turns out to be not particularly useful, as none of the users
have the same tenure, wich means we're just reporting the individual
users salaries

It might be more helpful to bucket the tenures:
    

In [21]:
def tenure_bucket(tenure):
    if tenure < 2:
        return 'less than two'
    elif tenure < 5:
        return 'between two and five'
    else:
        return 'more than five'

Then we can group together the salaries corresponding to each bucket:

In [22]:
# keys ae tenure bukets, values are lists of salaries for that bucket
salary_by_tenure_bucket = defaultdict(list)
for salary, tenure in salaries_and_tenures:
    bucket = tenure_bucket(tenure)
    salary_by_tenure_bucket[bucket].append(salary)
salary_by_tenure_bucket

defaultdict(list,
            {'more than five': [83000, 88000, 76000, 69000, 76000, 83000],
             'less than two': [48000, 48000],
             'between two and five': [60000, 63000]})

And finally compute the average salary for each group

In [23]:
# keys are tenure buckets, values are average saalry for that bucket
average_salary_by_bucket = {
    tenure_bucket: sum(salaries) / len(salaries)
        for tenure_bucket, salaries in salary_by_tenure_bucket.items()
}
average_salary_by_bucket

{'more than five': 79166.66666666667,
 'less than two': 48000.0,
 'between two and five': 61500.0}

And you have your soundbite: “Data scientists with more than five years’ experience earn 65% more than data scientists with little or no experience!”

### Topics of interest
As you’re wrapping up your first day, the VP of Content Strategy asks you for data about what topics users are most interested in, so that she can plan out her blog calendar accordingly. 

One simple (if not particularly exciting) way to find the most popular interests is to count the words:
- lowervase each interest
- split it into words
- count the results

In [27]:
words_and_counts = Counter(
    word for user, interest in interests
             for word in interest.lower().split()
)

words_and_counts

Counter({'hadoop': 2,
         'big': 3,
         'data': 3,
         'hbase': 2,
         'java': 3,
         'spark': 1,
         'storm': 1,
         'cassandra': 2,
         'nosql': 1,
         'mongodb': 1,
         'postgres': 1,
         'python': 3,
         'scikit-learn': 2,
         'scipy': 1,
         'numpy': 1,
         'statsmodels': 1,
         'pandas': 1,
         'r': 2,
         'statistics': 2,
         'regression': 2,
         'probability': 2,
         'machine': 2,
         'learning': 3,
         'decision': 1,
         'trees': 1,
         'libsvm': 1,
         'c++': 1,
         'haskell': 1,
         'programming': 1,
         'languages': 1,
         'mathematics': 1,
         'theory': 1,
         'mahout': 1,
         'neural': 2,
         'networks': 2,
         'deep': 1,
         'artificial': 1,
         'intelligence': 1,
         'mapreduce': 1})

In [28]:
# this makes it easy to list out the words out
# that occur more than once
for word, count in words_and_counts.most_common():
    if count > 1:
        print(word, count)

big 3
data 3
java 3
python 3
learning 3
hadoop 2
hbase 2
cassandra 2
scikit-learn 2
r 2
statistics 2
regression 2
probability 2
machine 2
neural 2
networks 2
