<a href="https://colab.research.google.com/github/mattcturek/DataScienceFromScratch/blob/main/DataScienceFromScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science From Scratch

# Chapter 1

### Introduction

## The Ascendance of Data

We live in a world that's drowning in data. Websites track every user's every click. Your smartphone is building up a record of your location and speed every second of every day. "Quantified selfers" wear pedometers-on-steroids that are always recording their hear rates, movement habits, diet, and sleep patterns. Smart cars collecting driving habits, smart homes collect living habits, and smart marketers collect purchasing habits. The internet itself represents a huge graph of knowledge that contains (among other things) an enormous cross-referenced encyclopedia; domain-specific databases about movies, music,sports results, pinball machines, memes, and ocktails; and too. many government statistics (some of them nearly true!) from too many governments to wrap your head around.

Buried in these data are answers to countless questions that no one's ever thought to ask. In this book, we'll learn how to find them.

## What Is Data Science?

There's a joke that syas a data scientist is someone who knows more statistics than a computer sientist and more computer science than a statistician. ( I didn't say it was a good joke.) In fact, some addta scientists are - for all practical purposes - statisticians, while others are fairlky indistinguishable from softwre engineers. Some are machine learning experts, while others couldn't machine-learn their way out of kindergarten. Some are PhDs with impressive publication records, while others have never read an academic paper (shame on them). In short, pretty much no matter how you define data science, you'll find practitioners for whom the definition is totally, absolutely wrong.



## Finding Key Connectors

It's your first day on the job at DataSciencester, and the VP of Networking is full of questions about your users. Until now he's had no one to ask, so he's very excited to have you aboard.

In particular, he wants you to identify whoo the "key connectors" are among data scientists. To this end, he gives you adump of the entire DataSciencester network. (In real life, people don't typically hand you the data you need. Chaper 9 is devoted to getting data.)

What does this data dump look like? It consists of a list of users, each represented by a <code>dict</code> that contains that user's <code>id</code> (which is a number) and <code>name</code> (which, in one of the great cosmic coincidences, rhymes with the user's <code>id</code>):

In [None]:
users = [
    {"id": 0, "name": "Hero"},
    {"id": 1, "name": "Dunn"},
    {"id": 2, "name": "Sue"},
    {"id": 3, "name": "Chi"},
    {"id": 4, "name": "Thor"},
    {"id": 5, "name": "Clive"},
    {"id": 6, "name": "Hicks"},
    {"id": 7, "name": "Devin"},
    {"id": 8, "name": "Kate"},
    {"id": 9, "name": "Klein"}
]

friendship_pairs = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
                    (4, 5), (5, 6), (5, 7), (6,8), (7,8), (8,9)]

Having the friendships represented as a list of pairs is not the easiest way to work with them. To find all the friendships for user 1, you have to iterate over every pair looking for pairs containing 1. If you had a lot of pairs, this would take a long time.

Instead, let's create a <code>dict</code> where the keys are user <code>id</code>s and the values are lists of friend <code>id</code>s. (Looking things up in a <code>dict</code> is very fast.)

We'll still have to look at every pair to create the <code>dict</code>, but we only have to do that once, and we'll get cheap lookups after that:

In [None]:
# Initialize the dict with an empty list for each user id:
friendships = {user["id"]: {} for user in users}

# and loop over the friendship pairs to populate it:
for i, j in friendship_pairs:
  friendships[i].append(j)   # Add j as a friend of user i
  friendships[j].append(i).  # Add i as a friend of user j

Now that we have the friendships in a <code>dict</code>, we can easilky ask questions of our graph, like "Waht's the average number of connections?"

First we find the <i>total</i> number of connections, by summing up the lengths of all the <code>friends</code> lists:

In [None]:
def number_of_friends(user):
  """How many friends does _user_ have?"""
  user_id = user["id"]
  friend_ids = friendships[user_id]
  return len(friend_ids)

total_connections = sum(number_of_friends(user) for user in users)

And then we just divide by the number of users:

In [None]:
num_users = len(users)
avg_connections = total_connections / num_users

It's also easy to find the most connected people - they're the people who have the largest numbers of friends.

Since there aren't very many users, we can simply sort them from "most friends" to "least friends":

In [None]:
# Create a list (user_id, number_of_friends).
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]

num_frinds_by_id.sort(key=lambda id_and_friends[1], reverse=True)

One way to think of what we've done is a way of identifiying people who are somehow central to the network. In fact, what we've just computed is the network metric <i>degree centrality</i>

## Data Scientists You May Know

While you're still filling out new-hire paperwork, the VP of Fraternization comes by your desk. She wants to encourage more connections among your members, and she asks you to design a "Data Scientists You May Know" suggester.

Your first instict is to suggest that users might know the friends of their friends. So you write some code to iterate over their friends and collect the friends' friends:

In [None]:
def foaf_ids_bad(user):
  """foaf is short for " friend of a friend " """
  return [foaf_id
          for friend_id in friendships[user["id"]]
          for foaf_id in friendships[friend_id]]

Knowing that people are friends of friends in multiple ways seems like interesting information, so maybe instead we should produce a <i>count</i> of mutual friends. And we should probably exclude people already known to the user:

In [None]:
from collections import Counter

def friends_of_friends(user):
  user_id = user["id"]
  return Counter(
      foaf_id for friend_id in friendships[user_id]
      for foaf_id in friendships[friend_id]
      if foaf_id != user_id
      and foaf_id not in friendships[user_id]
  )

As a data scientist, you know that you also might enjoy meeting users with siimlar interests. (This is a good example of the "substantive expertise" aspect of data science.) After asking around, you manage to get your hands on this data, as a list of pairs <code>(user_id, interest):</code>

In [None]:
interests = [
    (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
    (0, "Spark"), (0, "Storm"), (0, "Cassandra"),
    (1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
    (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
    (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
    (3, "statistics"), (3, "regression"), (3, "probability"),
    (4, "machine learning"), (4, "regression"), (4, "decision trees"),
    (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
    (5, "Askell"), (5, "programming languages"), (6, "statstics"),
    (6, "probability"), (6, "mathematics"), (6, "theory"),
    (7, "machine learning"), (7 "scikit-learn"), (7, "Mahout"),
    (7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
    (8, "Bid Data"), (8, "artificial intelligence"), (9, "Hadoop"),
    (0, "Java"), (9, "MapReduce"), (9, "Big Data")
]

For example, Hero (id 0) has no friends in common with Klein (id 9), but they share interests in Java and Big Data.

It's easy to build a function that finds users with a certain interest:

In [None]:
def data_scientists_who_like(target_interest):
  """Find the ids of all the users who like the target interest."""
  return [user_id
          for user_id, user_interest in interests
          if user_interst == target_interest]

This owrks, but it has to examine the whole list of interests for every search. If we have a lot of users and interests (or if we just want to do  a lot of searches), we're probably better off building an index from interests to users:

In [None]:
# Keys are interests, values are lists of user_ids with that interest

from collections import defaultdict

user_ids_by_interest = defaultdict(list)

for user_id, interst in intersts:
  user_ids_by_interest[interest].append(user_id)

And another from users to interests:

In [None]:
# Keys are user_ids, values are lists of interests for that user_id

from collections import defaultdict

interests_by_user_id = defaultdict(list)

for user_id, interest in interests:
  interests_by_user_id[user_id].append(interest)

Now it's easy to find who has the most interests in common with a given user:

 - Iterate over the list
 - For each interest, iterate over the other users with that interst
 - Keep count of how many times we see each other user

In code:

In [None]:
def most_common_interests_with(user):
  return Counter(
      interested_user_id for interests in interests_by_user_id[user["id"]]
      for interested_user_id in user_ids_by_interest[interest]
      if interested_user_id != user["id"]
  )

We could then use this to build a richer "Data Scientists You May Know" feature based on a combination of mutual friends and mutaul interests. We'll explore these kinds of applications in Chapter 23.

## Salaries and Experience

Right as you head to lunch, the VP of Public Relations asks if you can provide some fun facts about how much data scientists earn. Salary data is of course sensitive, but he manages to provide you an anonymouse dataset containing each user's <code>salary</code> (in dollars) and <code>tenure</code> as a data scientist (in years).

In [None]:
salaries_and_tenures = [
    (83000, 8.7), (88000, 8.1),
    (48000, 0.7), (76000, 6),
    (69000, 6.5), (67000, 7.5),
    (60000, 2.5), (83000, 10),
    (48000, 1.9), (63000, 4.2)]

The natural first step is to plot the data (which we'll see how to do in Chapter 3).

It seems clear that people with more experience tend to earn more. How can you turn this into a fun fact? Your first idea is to look at the average salary for each tenure:

In [None]:
# Keys are years, values are lists of the salaries for each tenure.
salary_by_tenure = defaultdict(list)

for salary, tenure in salaries_and_tenures:
  salary_by_tenure[tenure].append(salary)

# Keys are years, each value is average salary for that tenure.
average_salary_by_tenure = {
    tenure: sum(salaries) / len(salaries)
    for tenure, salaries in salary_by_tenure.items()
}