# Welcome to Python Fundamentals for Data Science Part II

### Intro to Data Acquisition

Data acquisition (also called data mining) is the process of gathering data.

In addition, some things to consider when acquiring data are:

- What data is needed to achieve the goal?
- How much data is needed?
- Where and how can this data be found?
- What legal and privacy concerns should be considered?

### The role of data collection

Imagine for a moment that you are collecting data about books. You decided to record the title, author, and number of pages of all the books in your local library. You decided not to include language, subtitles, editors, or publishers.

If you want to publish this data to make it available to others, you would need to document how you measured your variables (i.e., were appendices included in the page count?) and the parameters for collection (i.e., your local library). This is your methodology.

### Data sources

Data can be acquired from many different sources. Broadly, they can be categorized into primary data and secondary data.

Primary data is data collected by the individual or organization who will be doing the analysis. Examples include:

- Experiments (e.g., wet lab experiments like gene sequencing)
- Observations (e.g., surveys, sensors, in situ collection)
- Simulations (e.g., theoretical models like climate models)
- Scraping or compiling (e.g., webscraping, text mining)

Secondary data is data collected by someone else and is typically published for public use. Most data you will use falls into this category. Examples include:

- Any primary data that was collected by someone else
- Institutionalized data banks (e.g., census, gene sequences)

### Cleaned vs. raw data

There is another subcategory of secondary data that can be called “pre-cleaned” data. While pre-cleaned data is undoubtedly easier to use, you lose some of the flexibility and control that working with unaltered, “raw” data offers.

### Data file formats

Data can come in a variety of different file formats, depending on the type of data. Being able to open and convert between these file types opens a whole world of data that is otherwise inaccessible. Examples of file formats include:

- Tabular (e.g., .csv, .tsv, .xlsx)
- Non-tabular (e.g., .txt, .rtf, .xml)
- Image (e.g., .png, .jpg, .tif)
- Agnostic (e.g., .dat)

## Where to get data

### Primary data

Conducting research and experiments is typically out of the scope for Data Scientists, but surveys and simulations are common methods for acquiring primary data. 
- Webscraping is also a special case of primary data collection by extracting or copying data directly from a website.

### Secondary data

Secondary data can be obtained from many different websites. Some of the most popular repositories include:
- GitHub
- Kaggle
- KDnuggets
- UCI Machine Learning Repository
- US Government’s Open Data
- Five Thirty Eight
- Amazon Web Services
    - BuzzFeed
    - Data is Plural
    - Harvard HCI
    
Secondary data can sometimes be obtained via an application programming interface (API). APIs are built around the HTTP request/response cycle. A client (you) sends a request for data to a website’s server through an API call. Then, the server searches its database and responds either with the data, or an error stating that the request cannot be fulfilled.

# Binomial events

Binomial events always have 2 possible outcomes, which we refer to as success and failure.

Sometimes you want to simulate a lot of different scenarios. It would be very expensive to run thousands of tests, but it’s very cheap to generate thousands of results.

The probability of a successful outcome is represented by the parameter p. For example, for the event of a coin toss using a fair coin, p would be 0.5.

There are lots of ways to do this. We could flip a coin a bunch of times and write down the results or we could use the random.binomial() method from the numpy library in Python.

To use the random.binomial() method, we have to tell it how many trials we want to simulate (n) and the probability of ‘success’ in a single trial (p), and how many experiments to run.

In the example below, there was 1 flip per trial (n), the probability (p) of getting ‘success’ was .5 (the coin is fair), and we conducted the experiment 2,000 times (size).

In [3]:
import numpy

In [4]:
print(numpy.random.binomial(n=1, p=0.5, size=500))

[0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0
 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 0 0
 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0
 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1 1 1 0 0 0
 0 0 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1
 0 1 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 1 1 0
 0 1 1 0 1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1
 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 1 0 0 0 1 1 0 0 1 1 1 1 0 0 1
 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 1 1 0 1 1
 0 0 0 0 1 1 0 1 1 0 1 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0
 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0
 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0
 0 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0
 1 0 0 1 1 0 1 1 1 0 1 1 

In [5]:
print(numpy.random.binomial(n=100, p=0.8, size=500))

[84 81 72 79 82 79 83 79 77 80 76 75 81 82 81 81 82 85 72 82 78 85 76 75
 79 76 76 79 78 80 82 86 80 83 86 82 82 77 83 76 81 74 80 75 81 81 82 75
 89 74 82 83 82 83 78 77 81 77 80 80 82 77 80 79 76 79 86 76 80 85 81 71
 87 86 82 80 81 77 83 81 82 89 81 83 81 82 80 86 72 82 75 81 81 82 74 75
 75 80 86 80 88 82 84 86 81 75 82 82 78 82 87 73 82 80 84 87 72 74 75 76
 76 83 81 85 80 79 79 74 72 80 82 77 83 82 77 78 74 80 79 74 83 81 80 80
 81 84 74 83 74 78 81 82 81 83 81 69 82 82 85 89 75 84 80 73 82 79 87 81
 80 72 83 80 80 79 89 75 78 76 83 81 84 81 79 79 82 79 78 85 83 87 81 87
 78 81 80 73 73 84 83 79 75 76 81 84 78 81 74 82 78 78 77 84 83 82 83 77
 77 81 83 81 85 74 82 80 79 73 78 77 81 88 74 83 73 76 90 71 83 81 82 73
 85 83 76 83 87 87 75 83 74 84 88 73 85 86 81 80 82 78 74 82 80 83 81 84
 79 77 77 74 84 80 84 84 89 80 77 76 73 80 83 88 74 75 84 83 80 78 79 82
 83 88 82 81 87 79 76 80 86 81 86 79 84 77 79 78 70 80 79 76 83 82 74 81
 83 86 87 74 83 85 84 82 81 84 75 84 84 83 79 78 87

# Introduction to Strings

A string is a sequence of characters contained within a pair of 'single quotes' or "double quotes".

In [21]:
favorite_word = "hello"
print(favorite_word)

hello


# A string can be thought of as a list of characters.
Like any other list, each character in a string has an index.

In [24]:
first_letter = favorite_word[0]
print(first_letter)

h


# String Slices

In [34]:
first_name = "Rodrigo"
last_name = "Villanueva"

new_account = last_name[:5]
new_account = last_name[5:]
print(new_account)

temp_password = last_name[2:6]
print(temp_password)

nueva
llan


# Concatenating Strings

In [31]:
first_name = "Julie"
last_name = "Blevins"

full_name = first_name + ' '+ last_name

print(full_name)

Julie Blevins


# find length of string

In [33]:
len("Julie")

5

# Negative Indices

In [None]:
company_motto = "Copeland's Corporate Company helps you capably cope with the constant cacophony of daily life"

second_to_last = company_motto[-2]
final_word = company_motto[:-4]
print(final_word)

# Strings are Immutable

means that we cannot change a string once it is created.

In [38]:
first_name = "Bob"
last_name = "Daily"

# first_name[0] = "R"
fixed_first_name = "R" + first_name[1:] 
print(fixed_first_name)

Rob


# Escape Characters
you’ll find that you want to include characters that already have a special meaning in python.

In [39]:
favorite_fruit_conversation = "He said, \"blueberries are my favorite!\""
print(favorite_fruit_conversation)

He said, "blueberries are my favorite!"


# Iterating through Strings

 strings are lists, that means we can iterate through a string using for or while loops.

In [41]:
def print_each_letter(word):
  for letter in word:
    print(letter)
    
print_each_letter("blue")

b
l
u
e


# Strings and Conditionals

In [42]:
favorite_fruit = "blueberry"
counter = 0
for character in favorite_fruit:
  if character == "b":
    counter = counter + 1
print(counter)

2


In [43]:
print("blue" in "blueberry")

True


In [44]:
print("e" in "blueberry" and "e" in "carrot")

False


# String methods

## Formatting Methods

There are three string methods that can change the casing of a string. These are .lower(), .upper(), and .title().

.lower() returns the string with all lowercase characters.
.upper() returns the string with all uppercase characters.
.title() returns the string in title case, which means the first letter of each word is capitalized.

In [47]:
favorite_song = 'SmOoTH'
print(favorite_song.lower())
print(favorite_song.upper())
print(favorite_song.title())

smooth
SMOOTH
Smooth


## Splitting Strings

.split() is performed on a string, takes one argument, and returns a list of substrings found between the given argument

In [48]:
man_its_a_hot_one = "Like seven inches from the midday sun"
print(man_its_a_hot_one.split())

['Like', 'seven', 'inches', 'from', 'the', 'midday', 'sun']


In [49]:
greatest_guitarist = "santana"
print(greatest_guitarist.split('n'))

['sa', 'ta', 'a']


In [50]:
smooth_chorus = \
"""And if you said, "This life ain't good enough."
I would give my world to lift you up
I could change my life to better suit your mood
Because you're so smooth"""
 
chorus_lines = smooth_chorus.split('\n')
 
print(chorus_lines)

['And if you said, "This life ain\'t good enough."', 'I would give my world to lift you up', 'I could change my life to better suit your mood', "Because you're so smooth"]


## Joining Strings

In [51]:
my_munequita = ['My', 'Spanish', 'Harlem', 'Mona', 'Lisa']
print(' '.join(my_munequita))

My Spanish Harlem Mona Lisa


In [53]:
winter_trees_lines = ['All the complicated details', 'of the attiring and', 'the disattiring are completed!', 'A liquid moon', 'moves gently among', 'the long branches.', 'Thus having prepared their buds', 'against a sure winter', 'the wise trees', 'stand sleeping in the cold.']

winter_trees_full = "\n" .join(winter_trees_lines)
print(winter_trees_full)

All the complicated details
of the attiring and
the disattiring are completed!
A liquid moon
moves gently among
the long branches.
Thus having prepared their buds
against a sure winter
the wise trees
stand sleeping in the cold.


## .strip()

In [54]:
featuring = "           rob thomas                 "
print(featuring.strip())

rob thomas


In [55]:
featuring = "!!!rob thomas       !!!!!"
print(featuring.strip('!'))

rob thomas       


## Replace

In [56]:
with_spaces = "You got the kind of loving that can be so smooth"
with_underscores = with_spaces.replace(' ', '_')
print(with_underscores)

You_got_the_kind_of_loving_that_can_be_so_smooth


## .find()

find a value and return index

In [57]:
print('smooth'.find('t'))

4


In [58]:
print("smooth".find('oo'))

2


## .format()

In [59]:
def favorite_song_statement(song, artist):
  return "My favorite song is {} by {}.".format(song, artist)
print(favorite_song_statement("Smooth", "Santana"))

My favorite song is Smooth by Santana.


In [60]:
def favorite_song_statement(song, artist):
  return "My favorite song is {song} by {artist}.".format(song=song, artist=artist)
print(favorite_song_statement("Smooth", "Santana"))

My favorite song is Smooth by Santana.
