Data is stored in data structures that are built into the Python language: numbers, strings, lists, dictionaries, etc

# Dictionaries

In [2]:
info_table = [
    {"color": "green", "number": 7},
    {"color": "red", "number": 2},
    {"color": "orange", "number": 1}
]

type(info_table) 


list

In [3]:
for row in info_table:
    print(row["color"])

green
red
orange


In [4]:
import pandas as pd
df = pd.DataFrame(info_table)
type(df)

pandas.core.frame.DataFrame

In [5]:
print(df["color"]) 

0     green
1       red
2    orange
Name: color, dtype: object


# Python Modules
The main modules we will focus on are csv and json.
The first major file type we will explore is CSV (comma-separated value).
The second major file type we will explore is JSON (JavaScript object notation

# NumPy
In Python, the most fundamental package used for scientific computation is NumPy (Numerical Python). It provides lots of useful functionality for mathematical operations on vectors and matrices in Python. Matrix computation is the primary strength of NumPy.

# SciPy
In the Data Science domain, Python’s SciPy stack (a collection of software specifically designed for scientific computing) is used heavily for conducting scientific experiments. SciPy is a library of software for engineering and science applications and contains functions for linear algebra, optimization, integration, and statistics.

# Statsmodels
Statsmodels is a library for Python that enables its users to conduct data exploration via the use of various methods of estimation of statistical models and performing statistical assertions and analysis. The library provides insights when diagnosing issues with linear regression models, generalized linear models, discrete choice models, robust linear models, and time series analysis models with various estimators.

# Pandas
Pandas is a Python package designed to work with “relational” data and helps replicates the functionality of relational databases in a simple and intuitive way. It is designed for quick and easy data cleansing, manipulation, aggregation, and visualization.
There are two main data structures in the library:

“Series” - one-dimensional
“DataFrames” - two-dimensional
 
Here are a few ways in which Pandas may come in handy:

Easily delete and add columns from DataFrame
Convert data structures to DataFrame objects
Handle missing data and outliers
Powerful grouping and aggregation functionality
Offers visualization functionality to plot complex statistical visualizations on the go
The data structures in Pandas are highly compatible with most of the other libraries

# MatplotLib
Matplotlib is another SciPy stack package and a library that is tailored for the generation of simple and powerful visualizations. Line plots
Scatter plots
Bar charts and Histograms
Pie charts
Stem plots
Contour plots
Quiver plots
Spectrograms

# Seaborn 
Seaborn extends the functionality of Matplotlib and that’s why it can address the two biggest issues with Matplotlib - the quality of plots and parameter defaults. Your plots with Seaborn will be more attractive, need less time to create, and will reveal more information.

# Scikit-Learn
For machine learning, one of the most heavily used packages is scikit-learn. The package makes heavy use of its mathematical operations to model and test complex computational algorithms.

# Deep Learning  (Keras / TensorFlow)
TensorFlow is an open-source library of data flow graph computations, which are fine-tuned for heavy duty Machine Learning. TensorFlow was designed to meet the performance requirements of Google for training Deep Neural Networks in order to analyze visual and textual data. The key feature of TensorFlow is its multi-layered nodes system that enables quick training of artificial neural networks on big data. This is the library that powers Google’s voice recognition and object recognition in real time.
Keras is an open-source library for building Neural Networks with a high-level of interface abstraction.

# Statistical Methods in Pandas

df.info(), .describe(), .mean(), .quantile(),
.mode() -- the mode of the column
.count() -- the count of the total number of entries in a column
.std() -- the standard deviation for the column
.var() -- the variance for the column
.sum() -- the sum of all values in the column
.cumsum() -- the cumulative sum, where each cell index contains the sum of all indices lower than, and including, itself.

# Summary Statistics for Categorical Columns

These methods are extremely useful when dealing with categorical data!

.unique() shows us all the unique values contained in the column.

.value_counts() shows us a count for how many times each unique value is present in a dataset, giving us a feel for the distribution of values in the column.

Sometimes, we'll need to make changes to our dataset, or to compute functions on our data that aren't built-in to Pandas. We can do this by passing lambda values into the apply() method when working with Pandas series, and the .applymap() method when working with Pandas DataFrames.

# Panda Dataframe
 
 Students will learn how to create a dataframe, how to view dataframe contents as well as many pandas methods to help analyze and modify data within a dataframe. Topics covered in this lesson include: read_csv, .index, .info, .describe, .dtypes, .values, .head, .tail, .shape, scatterplot, distplot, .concat, filtering, adding columns, aggregating methods including .mean, .min, .max, .value_counts, and sort_values

# Apply lambda functions
df['Review_Word_Length'] = df['text'].map(lambda x: len(x.split()))
df.head()
df.shape
# Group data
df.groupby('business_id')['stars'].mean().head()
# Check for duplicates
df.duplicated().value_counts()
#Use keep=False to keep all duplicates and sort_values to put duplicates next to each other
df[df.duplicated(keep=False)].sort_values(by='business_id')
# Remove duplicates
df = df.drop_duplicates()
df.shape
# Recheck for duplicates
df.duplicated().value_counts()
#Duplicates should no longer exist
df[df.duplicated(keep=False)].sort_values(by='business_id')

# Detecting missing data
df.isna()
df.isna().sum()

# Categorical data
df['Embarked'].unique()

# Strategies for dealing with missing data
We have three options for dealing with missing values -- removing them from the dataset, keeping them, or replacing them with another value.
## replacing continous data
df['Fare'].fillna(df['Fare'].median())

 Beginning with an exploratory data analysis (EDA) the data is inspected and pandas packages are used to begin cleaning the data for analysis. Topics included in this lesson are: .info, .describe, .value_counts, .map, .apply, .isna, .unique, lambda functions, handling missing data, dropping missing data, .applymap, and faster numpy methods including np.where and np.select


# Using .groupby()
df.groupby('Sex').sum()

Some of the most common aggregate methods you may want to use are:

.min(): returns the minimum value for each column by group
.max(): returns the maximum value for each column by group
.mean(): returns the average value for each column by group
.median(): returns the median value for each column by group
.count(): returns the count of each column by group

# Multiple groups
df.groupby(['Sex', 'Pclass']).mean()

# Selecting information from grouped objects
df.groupby(['Sex', 'Pclass'])['Survived'].mean()

# Combining DataFrames With Pandas
to_concat = [df1, df2, df3]
big_df = pd.concat(to_concat)

some_dataframe.set_index('name_of_index_column', inplace=True)

If inplace is not specified it will default to False, meaning that a copy of the DataFrame with the requested changes will be returned, but the original object will remain unchanged.
 
An Outer Join returns all records from both tables
An Inner Join returns only the records with matching keys in both tables
A Left Join returns all the records from the left table, as well as any records from the right table that have a matching key with a record from the left table
A Right Join returns all the records from the right table, as well as any records from the left table that have a matching key with a record from the right table

joined_df = df1.join(df2, how='inner')

If how= is not specified, it defaults to 'left'.

# Pivot tables
 method creates pivot tables with pandas   .pivot()
 
 When I want to combine multiple data frames that have columns in common         .merge()
 
# Lambda functions
They're very useful for transforming a column feature. For example, you might want to extract the day from a date.

import pandas as pd
dates = pd.Series(['12-01-2017', '12-02-2017', '12-03-2017', '12-04-2017'])
dates.map(lambda x: x.split('-')[1])

# Combining DataFrames
You can combine dataframes by merging them (joining data by a common field) or concatenating them (appending data at the beginning or end).

df1 = pd.DataFrame(dates)
df2 = pd.DataFrame(['12-05-2017', '12-06-2017', '12-07-2017'])
pd.concat([df1, df2])

# Grouping and aggregating
df = pd.read_csv('titanic.csv')
df.head()
grouped = df.groupby(['Pclass', 'Sex'])['Age'].mean().reset_index()
grouped.head()

# Pivot tables
pivoted = grouped.pivot(index='Pclass', columns = 'Sex', values='Age')
pivoted

# SQL Structured Query Language

SQL databases are containers that can contain multiple tables. Each table has a schema that describes its columns (and their data types), and an entity relationship diagram (or ERD) is a visual representation of all tables and their relationships.

Unlike a CSV, a SQL table can also enforce the data types of the columns, which are described in the table schema. The schema for this table might look something like this:

CREATE TABLE people (
  id INTEGER PRIMARY KEY,
  name TEXT,
  age INTEGER,
  email TEXT
);

A simple query would look something like this:

SELECT col1, col2, col3
FROM table
WHERE records match criteria
LIMIT 100;
Don't worry if this is confusing now, you'll soon learn what each line does. For now, just notice that:

Queries start with the SELECT clause, followed by what you want to select. If selecting multiple columns, you separate them with a comma.
Then you specify where that data is being retrieved from the using the FROM clause followed by the table name.
Afterward, you can provide conditions such as filters or limits on the amount of data returned.

A primary key is a unique identifier for a table. You'll see that the columns that are the primary key for one table can also appear on other tables. This is known as a foreign key aka the primary key from a different ("foreign") table. 

Using the sqlite3 module, which of the following opens a connection to the database customers?
  sqlite3.connect(‘customers’)
  
After SELECT and FROM, the next SQL clause you're most likely to use as a data scientist is WHERE.

With just a SELECT expression, we can specify which columns we want to select, as well as transform the column values using aliases, built-in functions, and other expressions.

However if we want to filter the rows that we want to select, we also need to include a WHERE clause.

Order the results of your queries by using ORDER BY (ASC & DESC)
Limit the number of records returned by a query using LIMIT

Describe the relationship between aggregate functions and GROUP BY statements
Use GROUP BY statements in SQL to apply aggregate functions like: COUNT, MAX, MIN, and SUM
Create an alias in a SQL query
Use the HAVING clause to compare different aggregates
Compare the difference between the WHERE and HAVING clause

For a restaurant with a database table named orders with a numeric column price, which code would return the 3 highest priced orders?     SELECT * FROM orders ORDER BY price DESC LIMIT 3;

Which of the following queries is invalid for two tables employees and consumers that each have a person_id column?
  SELECT person_id FROM employees, consumers;
  
write SQL queries that start with SELECT. This allows you to read specific columns by name, or read all columns using SELECT *.

If you want to filter your selection so it only contains rows that meet certain criteria, you can use the WHERE clause.

If you want to order your selection, you can use the ORDER BY clause. Remember that the default behavior is to return results in ascending order (smallest to largest). You can verbosely specify this with the ASC keyword, or, more commonly, modify the behavior to sort in descending order (largest to smallest) with the DESC keyword.

If you want to limit the number of results, you can use the LIMIT clause. This is frequently used with ORDER BY to select the largest or smallest N results.

grouping data using the GROUP BY clause.

Often you also want to use an aggregate function when grouping, so that you can summarize the grouped data in some meaningful way. Examples of these aggregate functions include COUNT(), MAX(), MIN(), and AVG(). You can also use those aggregate functions without grouping, and they will return a single record for the table overall (or all rows selected based on your filters).

If you want to filter based on the result of an aggregate function, you need to use HAVING rather than WHERE. WHERE applies to the original rows of the table, whereas HAVING applies to the rows created by GROUP BY. Both can be used in the same query if needed, applied to different features.

# SQL JOIN
The SQL JOIN clause is the main way that you will write queries that combine data from multiple tables.
import pandas as pd
import sqlite3

conn = sqlite3.connect("payroll.db")
pd.read_sql("""SELECT * FROM employees;""", conn)
pd.read_sql("""SELECT name FROM managers WHERE id = 1;""", conn) 
pd.read_sql("""SELECT name FROM managers WHERE id = 2;""", conn) 
With a SQL join, we can do it all at once:

q = """
SELECT *
FROM employees
JOIN managers
    ON employees.manager_id = managers.id
;
"""
pd.read_sql(q, conn)

Most of the time when you have a JOIN, you want to specify which columns you actually want, instead of SELECT *. Something like this, using aliases to make everything really clear:

q = """
SELECT
    employees.name AS employee_name,
    employees.pay AS employee_pay,
    managers.name AS manager_name
FROM employees
JOIN managers
    ON employees.manager_id = managers.id
;
"""
pd.read_sql(q, conn) 

# SQL Subqueries
Another more-advanced technique we will introduce in this section is a SQL subquery. The above query, rewritten to use a subquery instead of JOIN, would be:

q = """
SELECT
    name AS employee_name,
    pay AS employee_pay,
    (
        SELECT name
        FROM managers
        WHERE managers.id = employees.manager_id
    ) AS manager_name
FROM employees
;
"""
pd.read_sql(q, conn) 

conn.close() 

# APIs - Application Programming Interfaces
An API is a communication protocol between 2 software systems. It describes the mechanism through which if one system requests some information using a predefined format, a remote system responds with an outcome that gets sent back to the first system.APIs are a way of allowing 2 applications to interact with each other. 

An API has three main components as listed below:

Access Permissions: Is the user allowed to ask for data or services?
Request: The service being asked for (e.g., if I give you current location using GPS, tell me the map around that place - as we see in Pokemon Go). A Request has two main parts:

Methods: Once the access is permitted, what questions can be asked.
Parameters: Additional details that can be sent with requests or responses
Response: The data or service as a result of the request.

A  client is a computer hardware device or software that requests a service made available by a server. The server is often (but not always) located on a separate physical computer.
A server is a physical computer dedicated to run services to serve the needs of clients. Depending on the service that is running, it could be a file server, database server, home media server, print server, email server or a web server.
The idea of a Client and Server communicating over a network is what makes viewing websites and interacting with Web applications (like Gmail, Facebook, LinkedIn) possible. This model is a way to describe the give-and-take relationship between the client and server in a Web application and governs how information passes between computers.
A Web application (Web app) is an application program that is stored on a remote server and delivered over the Internet through a browser interface

# The Web client
The client is what the end user interacts with. "Client-side" code is actually responsible for most of what a user actually sees. For requesting some information as a web page, the client side may be responsible for: includes:

Defining the structure of the Web page
Setting the look and feel of the Web page
Implementing a mechanism for responding to user interactions (clicking buttons, entering text, etc.)
Most of these tasks are managed by HTML/CSS/JavaScript-like technologies to structure the information, style of the page and provide interactive objects for navigation and focus. 

# The Web Server
A web server in a Web application is what listens to requests coming in from the clients. When you set up an HTTP (HyperText Transfer Protocol - Language of the internet) server, we set it up to listen to a port number. A port number is always associated with the IP address of a computer. You can think of ports as separate channels on a computer that we can use to perform different tasks: one port could be surfing www.facebook.comLinks to an external site. while another fetches your email.

# The Database
Databases are the foundations of Web architecture. An SQL/NoSQL or a similar type of database is a place to store information so that it can easily be accessed, managed, and updated. If you're building a social media site, for example, you might use a database to store information about your users, posts, comments, etc. When a visitor requests a page, the data inserted into the page comes from the site's database, allowing real-time user interactions with sites like Facebook or apps like Gmail.



# OAuth
OAuth stands for Open Authorization.

OAuth is an open-source protocol created to allow the creators of APIs and other online services to easily let them share private data or assets with users. One of the biggest challenges of building multi-user applications is making sure that you only give people access to the data and functionality they're supposed to have. OAuth provides a framework for allowing authenticated access, but without the risk of having to share the original login credentials such as a password.

It allows applications to have limited scopes to user data 

What does the folium package create? Interactive maps 



# The Steps of OAuth
Prior to using OAuth, we must also register our application with the authorizer and get our credentials to use during the process. We need to set up some information about the application, like the app's name or website, and most importantly, a redirect URI. The authorizer later uses this to contact the requesting app and tell them that the user said yes.

A URI (Uniform Resource Identifier) is a string that refers to a resource. The most common are URLs, which identify the resource by giving its location on the Web.

After registration, The first step is the authorization. Here, we send our users to the authorization server to ask for some permissions with our scope (permissions) that we would like to have. The user can see everything being requested on his behalf and confirm that they would like to grant our application access for those permissions.

The second step is the redirect. Redirect URIs are a critical part of the OAuth flow. After a user successfully authorizes their application, the authorization server then redirects the user back to the app with an authorization code in the URL. Because the redirect URL will contain sensitive information, it is critical that the service doesn’t redirect the user to arbitrary locations. The authorization code is used by our application in the final act of getting the access token.

The final step is acquisition. This is where we finally receive our access token from the service provider so we can process API requests for our user. We use the authorization code we received in the redirect to our redirect url and our own application secret (which is acquired during initial registration) in order to get our user’s access token. The access token can then be used to make API calls on behalf of our user. It allows applications to have limited scopes to user data If you've ever used your Facebook or Google account to log in to a 3rd party website or app, then you've used OAuth--OAuth is what makes this sort of Single-Sign-On or SSO ability possible.

What does the folium package create? Interactive maps

While there are many other kinds of APIs, as a data scientist, you'll typically be working with web APIs. The requests library in Python is a great starting point for making HTTP requests to APIs. Most APIs you encounter will return results in JSON format. The two most common HTTP methods you'll use when accessing APIs are GET (to retrieve information) and POST (to send information).

# HTML, CSS and Web Scraping
HTML stands for HyperText Markup Language - the "language of the web". You'll start by learning HTML syntax and practice exploring HTML documents. After that, you'll look into the process for handling new HTML elements that you might not have encountered before.

CSS or Cascading Style Sheets is how you make web pages look snazzy. You'll see more about how proper web development workflows separate content from presentation.

You'll start by learning a bit of HTML and CSS, the foundations for the web, and from there you'll take a look at how to scrape information from the web in order to systematically create and build datasets that may not be otherwise available to you.

# HTML Introduction
HTML, or HyperText Markup Language, is a markup language that describes the structure and semantic meaning of web pages. Web browsers, such as Mozilla Firefox, Internet Explorer, and Google Chrome interpret the HTML code and use it to render output. Unlike Python, JavaScript and other programming languages, markup languages like HTML don't have any logic behind them. Instead, they simply surround the content to convey structure and meaning.

Every web page you've ever visited is structured using HTML code. Being able to read and understand an HTML document is an incredibly useful tool in a data scientist's toolbox.

HTML makes use of tags which are interpreted by web browsers to affect how content is displayed. The p tag to define a paragraph is shown below:

Hello World

You can also alter any number of attributes inside of the opening tags. For example, the a element, which is used for links, has an href attribute to specify the destination address of the link. If you wanted to link to www.flatironschool.comLinks to an external site., you could do so as follows:

Flatiron School

You can also nest elements inside of each other. To have a link displayed as a separate paragraph, we could nest an a element inside of a p.

This link will be a part of a separate paragraph.

Basic HTML Document Structure
To use HTML5, the current up-to-date version, you can simply declare .

Next, you add an opening and closing html tag. This tells the web browser to interpret everything inside the tags as HTML code.

Every HTML page is made up of two primary sections: a head and a body. The head element contains metadata about the HTML document and other information for the browser, while the body element contains the actual content.
</head>

<body>
    <!-- content of our page will be here! -->

</body>
Comments
Let's also take a brief moment to recognize how to add comments into an HTML document. These won't get rendered to the browser at all: they're just helpful notes for the author.

Top 5 Pizza Places in NYC

Headers
HTML gives us access to different header elements, ranging from h1 to h6, with h1 being the largest and h6 being the smallest.

Dogs!
Why Dogs are Great
Different Breeds
Images
You can embed images on our web pages using the img element. The img element doesn't have a closing tag. The src attribute tells the browser where to find the image. The alt attribute will be displayed if an image can't be loaded, and also describes the image to search engines.Picture of a Dog

Lists
Some other useful HTML elements are lists. You can make bulleted, or unordered lists, using opening and closing ul tags. Inside, you can nest a li, or "list item" element for each item on our list.

My Favorite Things in No Particular Order
Coffee
Vinyl Records
Pickling
You can also make a numbered, or ordered list, using an ol tag.
Top 5 Pizza Places in NYC
DiFara Pizza
Lucali's
Sal and Carmine's
Juliana's
Joe's
In the HTML world, the Mozilla Developer Network (MDN) is an extremely trustworthy site.

# Cascading Style Sheets or CSS!
HTML lets you mark-up your content with semantic structure. It forms the skeleton of your web page. HTML authors believe that creating marked-up documents and styling marked-up documents are entirely separate tasks. They see a difference between writing content (the data within the HTML document) and specifying presentation, the rules for displaying the rendered elements.CSS tells you how to write rules that define how browsers will present HTML. CSS is the language for styling web pages. The focus is on the aesthetic quality of the page.

For each presentation rule, there are 3 things to keep in mind:

What is the specific HTML we want to style? What are the qualities we want to modify (e.g. the properties of text in a paragraph)? How do we want to modify the qualities of the element (e.g. font family, font color, font size, line height, letter spacing, etc.)? Once you've decided what to modify and how we can start writing CSS rules.

CSS selectors are a way of declaring which HTML elements you wish to style. Selectors can appear a few different ways:

The type of HTML element(h1, p, div, etc.) The value of an element's id or class (

,
). The value of an element's attributes (value="hello") The element's relationship with surrounding elements (a p within an element with class of .infobox) For example, if you want the body of the page to have a black background, your selector syntax may be html or body. For anchors, your selector would be a. A few more examples are listed below:
/* The CSS comment syntax is text between "slash-star" and "star-slash" */

/* selects all anchor tag elements in the document (e.g. Page Link) */ a

/* selects all headers of type h3 in the document (e.g.

Type selectors
) */ h3
/* selects all paragraph elements in the document (e.g.

Type selectors are used to...

) */ p Type selectors documentationLinks to an external site.
The element type class is a commonly used selector. Class selectors are used to select all elements that share a given class name. The class selector syntax is: .classname. Prefix the class name with a '.'(period).

/* select all elements that have the 'important-topic' classname (e.g.

and 
) */ .important-topic
/* select all elements that have the 'welcome-message' classname (e.g.

and

) */ .helpful-hint You can also use the id selector to style elements. However, there should be only one element with a given id in an HTML document. This can make styling with the ID selector ideal for one-off styles. The id selector syntax is: #idvalue. Prefix the id attribute of an element with a # (which is called "octothorpe," "pound sign", or "hashtag").

/* selects the HTML element with the id 'main-header' (e.g.

) */ #main-header
/* selects the HTML element with the id 'welcome-message' (e.g.

) */ #welcome-message

# Declare CSS Properties and Values
Each element has a list of qualities that can be styled. CSS "property" names identify those qualities. For text styling, examples of property names include text color, text-align and line-height.

CSS Property Values are directly related to property names. If you are working with the color property, the value could be a named color such as red, or #660000. Some properties have their values set with words, others with numbers, and some can take both.

A CSS property name with a CSS property value is a CSS declaration. To apply a CSS declaration like color: blue to a specific HTML element, you need to combine your CSS declaration with a CSS selector. The association between one or more CSS declarations and a CSS selector is called a CSS declaration block. CSS declarations (one or more) that applied to a specific selector are wrapped by curly braces ({ }). Each declaration inside a declaration block must be separated by a semi-colon (;).

Below is a sample CSS declaration block.

selector { color: blue; } /* This is a css declaration for a selector 'color' is a property name and 'blue' is a css property value !!!!! CSS declarations must end with a semi-colon (;) !!!!! */ Here's a more complete example declaration block.

/* The CSS declaration block below:

Will apply to all h1 elements
Will change the text color to blue
Will set the font family to Georgia
/ h1 { color: blue; font-family: Georgia; }
With the combination of HTML and CSS, you are able to define content, structure, and style for websites

HTML Basics Hyper Text Markup Language, or HTML, is a way to demarcate a document into different parts. Each part is marked by elements (using tags). Each element has its own special connotation that the browser uses to render the HTML document. Use this cheat sheetLinks to an external site. on HTML elements for guidance.

Elements All begin with < and end with >, e.g.,

(this last part is a tag). Most have an opening tag such as
and a closing tag
. The / indicates to the browser that that tag is a closing tag. The element is everything between the tags and the tags themselves. Some tags are self-closing like the line break element
. Elements can have IDs and classes to aid the browser in finding specific tags. Must begin with a letter A-Z or a-z. Can be followed by: letters (A-Za-z), digits (0-9), hyphens (-), and underscores (_). IDs can only be used once per page, e.g.:
. Classes can be used as many times as you want, e.g.:
. Elements nested inside other elements are called children. Children inherit attributes from their parents. Don't nest everything. Elements next to one another are siblings. Siblings do not inherit from one another but are important for selecting in CSS. Here is an example of element relations: html
CSS Basics Cascading Style Sheets, or CSS, is a language created to style an HTML document by telling the browser how specific elements should look. CSS does this by selecting elements based on their tag, ids, classes, or all of the above. The reason for CSS is the separation of concerns. You want HTML only to be concerned with how it displays and demarcates information, and we let CSS worry about how to make that information look pretty.

CSS selectors They select elements to assign them styles.

(wildcard) selects every element. An element, such as div, will select all elements of that type. They select an id like #some-id Classes are selected like this .some-class To select all children elements of a parent do something like this div p To select multiple different elements separate them by commas like this div, p, a Here's an example of CSS styling:

{ color: red; /* color in CSS refers to font color / } / all elements will have red font */


# The components of a web page
When we visit a web page, our web browser makes a GET request to a web server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

HTML — contain the main content of the page. CSS — add styling to make the page look nicer. JS — Javascript files add interactivity to web pages. Images — image formats, such as JPG and PNG allow web pages to show pictures. After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping.

HTML HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content.

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

<html>
</html>

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping: 

<html>
    <head>
    </head>
    <body>
    </body>
</html>

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:

<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>

Here's a paragraph of text!

Here's a second paragraph of text!

Here's a paragraph of text!
Here's a second paragraph of text!

Tags have commonly used names that depend on their position in relation to other tags:

child — a child is a tag inside another tag. So the two p tags above are both children of the body tag. parent — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag. sibiling — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body. We can also add properties to HTML tags that change their behavior:

<html>
  <head></head>
  <body>
    <p>
      Here's a paragraph of text!
      <a href="https://www.dataquest.io">Learn Data Science Online</a>
    </p>
    <p>
      Here's a second paragraph of text!
      <a href="https://www.python.org">Python</a>        
    </p>
  </body>
</html>


In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

div: indicates a division, or area, of the page.
b: bolds any text inside.
i: italicizes any text inside.
u: underlines any text inside.
table: creates a table.
form: creates an input form.


bold italics underlining There are two special properties that give HTML elements names, and make them easier to interact with when we’re scraping: class and id.

One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them. We can add classes and ids to our example:

<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>


# Webscraping with Python

import requests 
from bs4 import BeautifulSoup 
import pandas as pd

The requests library The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library (similar to interacting with APIs!).

req = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html") 

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

req.status_code 

We can print out the HTML content of the page using the content property:

req.content 

## Parsing a page with BeautifulSoup
We can use the BeautifulSoup library to parse this document, and extract the text from the <p> tag.

soup = BeautifulSoup(req.content) 
list(soup.children) 
print(soup.prettify()) 
    
As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup.

list(soup.children) 
[type(item) for item in list(soup.children)]
    
The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the second item in the list:

html = list(soup.children)[1] 
html 
    
Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now we can find the children inside the html tag:

list(html.children) 
    
As you can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body:

body = list(html.children)[3]
    
We can now isolate the p tag:

p = list(body.children)[1] 
    
Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

p.get_text 
    
## Finding all instances of a tag at once 
    
If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

soup = BeautifulSoup(req.content) 
soup.find_all('p') 

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

soup.find_all('p')[0].get_text() 
    
If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

soup.find('p') 
    
## Searching for tags by class and id
    
We introduced classes and ids earlier, but it probably wasn’t clear why they were useful. Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. Let's look at the following page:
    
    <html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
        <p class="outer-text first-item" id="second">
            <b>
                First outer paragraph.
            </b>
        </p>
        <p class="outer-text">
            <b>
                Second outer paragraph.
            </b>
        </p>
    </body>
</html>

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.

page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html") 
soup = BeautifulSoup(page.content)
soup 
    
Now, we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class outer-text:

soup.find_all('p', class_='outer-text') 
    
In the below example, we’ll look for any tag that has the class outer-text:

soup.find_all(class_="outer-text")[0] We can also search for elements by id:

soup.find_all(id="first") 
    
    
## More sophisticated webpages 
    
url = 'https://forecast.weather.gov/MapClick.php?lat=41.8843&lon=-87.6324#.XdPlJUVKg6g' 
request = requests.get(url) 
soup = BeautifulSoup(request.content) 
    
times = soup.find_all(class_='period-name') 
times 
    
descs = soup.find_all(class_='short-desc') 
descs 
    
together = [(entry[0].text, entry[1].text) for entry in zip(times, descs)] 
together
   
Pulling in a Table In general you'll need to examine the html code so that you can tell the BeautifulSoup parser what to look for!

url = 'https://www.pro-football-reference.com/'

res = requests.get(url) 
soup = BeautifulSoup(res.content) 

teams = [] 
table = soup.find('table', {'id': 'AFC'})

for row in table.find('tbody').find_all('tr'): 
    try: 
        team = {'name': row.find('th', {'data-stat': 'team'}).text, 
                'wins': row.find('td', {'data-stat': 'wins'}).text, 
                'losses': row.find('td', {'data-stat': 'losses'}).text, 
                'ties': row.find('td', {'data-stat': 'ties'}).text}
        teams.append(team) 
    except: 
    pass 
    
teams
    
## Combining our data into a Pandas DataFrame 
    
We can now combine the data into a Pandas DataFrame and analyze it.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

Football data from the table in dictionary form (very easy!)
football = pd.DataFrame(teams) football

Weather data from the list of doubles
weather = pd.DataFrame(together, columns=['time', 'description']) weather