## 1 Introduction to Pandas

### 1.1 Understanding pandas


Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things.

In this lesson, we'll learn:

- about the two core pandas types: dataframes and series
- how to select data using row and column labels
- a variety of methods for exploring data with pandas
- how to assign data using various techniques in pandas
- how to use boolean indexing with pandas for selection and assignment

We'll be working with data set from [Fortune](http://fortune.com/) magazine's [Global 500](https://en.wikipedia.org/wiki/Fortune_Global_500) list 2017, which ranks the top 500 corporations worldwide by revenue. The dataset we'll be using was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017), however we have modified the original data set into a more accessible format.

<img width="400" src="https://drive.google.com/uc?export=view&id=1vWtPGbbxR7Mn2xHg_KMa3MOKSqs05uyE">


The dataset is a CSV file called **f500.csv**. Here is a data dictionary for some of the columns in the CSV:

- **company** - The Name of the company.
- **rank** - The Global 500 rank for the company.
- **revenues** - The company's total revenues for the fiscal year, in millions of dollars (USD).
- **revenue_change** - The percentage change in revenue between the current and prior fiscal years.
- **profits** - Net income for the fiscal year, in millions of dollars (USD).
- **ceo** - The company's Chief Executive Officer.
- **industry** - The industry in which the company operates.
- **sector** - The sector in which the company operates.
- **previous_rank** - The Global 500 rank for the company for the prior year.
- **country** - The Country in which the company is headquartered.
- **hq_location** - The City and Country, (or City and State for the USA) where the company is headquarted.
- **employees** - Total employees (full-time equivalent, if available) at fiscal year-end.


The import convention for pandas is:

```python
import pandas as pd
```

We have already imported pandas and used the [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to read the CSV into a pandas object and assign it to the variable name f500. In the next mission we'll learn about **read_csv()**, but for now all you need to know is that it handles reading and parsing most CSV files automatically.

Pandas objects have a **.shape** attribute which returns a tuple representing the dimensions of each axis of the object. We'll use that and the Python's **type()** function to inspect the f500 pandas object.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Use Python's **type()** function to assign the type of **f500** to **f500_type.**
2. Use the **DataFrame.shape** attribute to assign the shape of **f500** to **f500_shape.**

In [0]:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None

# put your code here

### 1.2 Introducing DataFrames



The code we wrote in the previous screen let us know that our data has 500 rows and 16 columns, and is stored as a [pandas.core.frame.DataFrame object](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame). More commonly referred to as **pandas.DataFrame()** objects, or just **dataframes**, the type is the primary pandas data structure. Dataframes are two dimensional pandas objects.

We'll learn about the second pandas data structure, series, later in this lesson, but first, let's look at the anatomy of a dataframe, using a selection of our Fortune 500 data:

<img width="500" src="https://drive.google.com/uc?export=view&id=1lUAxPbqauhiMPdWCAM2oPOy0vsmvWtYy">

There are three key things we can observe immediately:

- In Red: Just like a 2D ndarray, there are two axes, however each axis of a dataframe has a specific name. The first axis is called **index**, and the second axis is called **columns.**
- In Blue: Our axis values have string **labels**, not just numeric locations.
- In Green: Our dataframe contains columns with **multiple dtypes**: integer, float, and string.

We can use the [DataFrame.dtypes](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes) attribute to return information about the types of each column. Let's see what this would return for our selection of data above:

```python
>>> f500_selection.dtypes

    rank          int64
    revenues      int64
    profits     float64
    country      object
    dtype: object
```

We can see three different data types (dtypes), which correspond to what we observed by looking at the data:

- int64
- float64
- object



When we import data, pandas will attempt to guess the correct dtype for each column. Generally, pandas does a pretty good job with this, which means we don't need to worry about specifying dtypes every time we start to work with data. Later in this course, we'll look at how to change the dtype of a column.

Next, let's learn a few handy methods we can use to get some high-level information about our dataframe:

- If we wanted to view the first few rows of our dataframe, we can use the [DataFrame.head()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method, which returns the first 5 rows of our dataframe. The **DataFrame.head()** method also accepts an optional integer parameter which specified the number of rows. We could use **f500.head(10)** to return the first 10 rows of our **f500 dataframe**.
- Similar in function to **DataFrame.head()**, we can use the [DataFrame.tail()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) method, to shows us the last rows of our dataframe. The **DataFrame.tail()** method accepts an optional integer parameter to specify the number of rows, defaulting to 5.
- If we wanted to get an overview of all the dtypes used in our dataframe, along with its shape and some extra information, we could use the [DataFrame.info()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html#pandas.DataFrame.info) method. Note that **DataFrame.info()** prints the information, rather than returning it, so we can't assign it to a variable.

Let's practice using these three new methods. Just like in the previous missions, the f500 variable we created in the previous section is available to you here.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Using the links above to the documentation if you need to, use the three methods we just learned about to learn more about the **f500** dataframe:

  - Use the **head()** method to select the first 6 rows and assign the result to **f500_head**.
  - Use the **tail()** method to select the last 8 rows and assign the result to **f500_tail.**
  - Use the **info()** method to display information about the dataframe.


In [0]:
# put your code here

### 1.3 Selecting Columns From a DataFrame by Label



By looking at the results produced by the **DataFrame.head()** and **DataFrame.tail()** methods in the previous screen, we can see that our data set seems to be pre-sorted in order of Fortune 500 rank.

We can also see that the **DataFrame.info()** method showed us the number of entries in our index (representing the number of rows), a list of each column with their dtype and the number of non-null values, as well as a summary of the different dtypes and memory usage. In pandas, null values are represented using NaN.

Because our axes in pandas have labels, we can select data using those labels. To do this, we use the [DataFrame.loc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc) method.

Throughout our pandas lessons you'll see **df** used in code examples as shorthand for a dataframe object. We use this convention because you also see this throughout the official pandas documentation, so getting used to reading it is important. You'll notice that we use brackets **([])** instead of parentheses **(())** when selecting by location.  The syntax for the **DataFrame.loc[]** method is:

```python
df.loc[row, column]
```

Where **row** and **column** refer to row and column labels respectively, and can be one of:

- A single label.
- A list or array of labels.
- A slice object with labels.
- A boolean array.

We'll look at boolean arrays later in this mission - for now, we're going to focus on the first three options. We're going to use the same selection of data we used in the previous screen, which is stored using the variable name **f500_selection** to make these examples easier.

<img width="600" src="https://drive.google.com/uc?export=view&id=1WNexstd5iVGtj04VxjcnQSqa-UTSETDg">


In each of these examples, we're going to use **:** to specify that we wish to select all rows, so we can focus making selections using column labels only.

First, let's select a single column by specifying a single label:

<img width="600" src="https://drive.google.com/uc?export=view&id=15V82UWlysJjrA_Eg0b5vKPeMcagBhQCI">


Selecting a single column returns a pandas series. We'll talk about pandas series objects more in the next screen, but for now the important thing is to note that the new series has the same index axis labels as the original dataframe. Let's look at how we can use a list of labels to select specific columns:

<img width="600" src="https://drive.google.com/uc?export=view&id=1MSj7K1OU_0LwnKACxpYLOTfU71faWb4h">


When we use a list of labels, a dataframe is returned with only the columns specified in our list, in the order specified in our list. Just like when we used a single column label, the new dataframe has the same index axis labels as the original. Lets finish by using a **slice object with labels** to select specific columns.

<img width="600" src="https://drive.google.com/uc?export=view&id=1CEtF_oFReD_-6pqheEfEd4nRWgakEgXQ">

Again we get a dataframe object, with all of the columns from the first up until **and including** the last column in our slice. 

Let's practice using these techniques to select specific columns from our f500 dataframe.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Select the **industry** column, and assign the result to the variable name **industries**.
2. Select the **rank**, **previous_rank** and **years_on_global_500_list** columns, in order, and assign the result to the variable name **previous**.
3. Select all columns from **revenues** up to and including **profit_change**, in order, and assign the result to the variable name **financial_data**.

In [0]:
# put your code

### 1.4 Column selection shortcuts



There are two shortcuts that pandas provides for accessing columns.

1. **Single Bracket** – Instead of **df.loc[:,"col_1"]** you can use **df["col1"]** to select columns. This works for single columns and lists of columns but not for for column slices. 
2. **Dot Accessor** – Instead of **df.loc[:,"col_1"]** you can use **df.col_1**. This shortcut does not work for labels that contain spaces or special characters. 

These shortcuts are designed to make some of the more common selection tasks easier. We recommend you always use the common shorthand in your code, as it will make your code easier to read. A summary of the techniques we've learned so far is below:

<img width="600" src="https://drive.google.com/uc?export=view&id=1BlhNf3XAGs0E50GISg0-W0sZXcRowrRe">

Let's practice selecting data by column some more, this time using the common shorthand method.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Select the **country** column, and assign the result to the variable name **countries**.
2. Select the **revenues** and **years_on_global_500_list** columns, in order, and assign the result to the variable name **revenues_years**.
3. Select all columns from **ceo** up to and including **sector**, in order, and assign the result to the variable name **ceo_to_sector**.



In [0]:
# put your code here

### 1.5 Selecting Items from a Series by Label



In the last section we observed that when you select just one column of a dataframe, you get a new pandas type: a **series object**. Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series, and anytime you see a 2D pandas object, it will be a dataframe.

You might like to think of a dataframe as being a collection of series objects, which is similar to how pandas stores the data behind the scenes.

<img width="600" src="https://drive.google.com/uc?export=view&id=1xNxF1TgmYOzlYj6ogofxKBiSO7FqsHOH">


To better understand the relationship between dataframe and series objects, we'll look at some examples. We'll start by looking at two pandas operations that each produce a series object:


<img width="600" src="https://drive.google.com/uc?export=view&id=1aI-saLdbP4eZOQKGjHTJT54mFCa2Vu42">

Because a series has only one axis, its axis labels are either the index axis or column axis labels, depending on whether it is representing a row or a column from the original dataframe. If we make a 2D selection from a dataframe, it will retain the labels from both axes:

<img width="600" src="https://drive.google.com/uc?export=view&id=1YyiMWxDLQ8bo4vkL1qZc5Pq49C4KNqE8">

Let's look at a brief summary of the differences between dataframes and series'.


<img width="400" src="https://drive.google.com/uc?export=view&id=1k7GgAAgsHY8YD-Z8x7Enerozt9MlqkuR">


Just like dataframes, we can use **Series.loc[]** to select items from a series using single labels, a list, or a slice object. We can also omit **loc[]** and use bracket shortcuts for all three. Let's look at an example:

```python
>>> print(s)

a    0
b    1
c    2
d    3
e    4
dtype: int64
```

We can select a single item:

```python
print(s["d"])

3
```

Like with dataframe columns, there is a dot accessor (eg, **s.d**) available, but this rarely used– even less than the dataframe dot accessor.

To select several items using a list:

```python
print(s[["a", "e", "c"]])

a    0
e    4
c    2
dtype: int64
```

And lastly, several items using a slice:

```python
print(s["a":"d"])

a    0
b    1
c    2
d    3
dtype: int64
```

Let's practice selecting data from pandas series':

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. From the pandas series **ceos**:
  -  Select the item at index label **Walmart** and assign the result to the variable name **walmart**.
  -  Select the items from index label **Apple** up to and including index label **Samsung Electronics** and assign the result to the variable name **apple_to_samsung**.
  -  Select the items with index labels **Exxon Mobil**, **BP**, and **Chevron**, in order, and assign the result to the variable name **oil_companies**.



In [0]:
ceos = f500["ceo"]

# put your code here

### 1.6 Selecting Rows From a DataFrame by Label



Now that we've learned how to select columns using the labels of the **'column'** axis, let's learn how to select rows using the labels of the **'index'** axis.

<img width="400" src="https://drive.google.com/uc?export=view&id=1HH7k0yK6abEG4xKiHgPTt5I_rz1OE5UD">

Selecting **rows** from a dataframe by label uses the same syntax as we use for **columns.** As a reminder:

```python
df.loc[row, column]
```

Where **row** and **column** refer to row and column labels. We'll look at how to select rows, again using our **f500_selection** dataframe to make these examples easier.

```python
print(type(f500_selection)
print(f500_selection)
```

```python
class 'pandas.core.frame.DataFrame'

                          rank  revenues  profits country
Walmart                      1    485873  13643.0     USA
State Grid                   2    315199   9571.3   China
Sinopec Group                3    267518   1257.9   China
China National Petroleum     4    262573   1867.5   China
Toyota Motor                 5    254694  16899.3   Japan
```

To select a single row:

```python
single_row = f500_selection.loc["Sinopec Group"]
print(type(single_row))
print(single_row)

class 'pandas.core.series.Series'

rank             3
revenues    267518
profits     1257.9
country      China
Name: Sinopec Group, dtype: object
```

As we would expect, a single row is returned as a series. We should take a moment to note that the dtype of this series is object. Because this series has to store integer, float, and string values pandas uses the object dtype, since none of the numeric types could cater for all values.

To select a list of rows:

```python
list_rows = f500_selection.loc[["Toyota Motor", "Walmart"]]
print(type(list_rows))
print(list_rows)

class 'pandas.core.frame.DataFrame'

              rank  revenues  profits country
Toyota Motor     5    254694  16899.3   Japan
Walmart          1    485873  13643.0     USA
```

For selection using slices, we can use the shortcut without brackets. This is the reason we can't use this shortcut for columns - because it's reserved for use with rows:

```python
slice_rows = f500_selection["State Grid":"Toyota Motor"]
print(type(slice_rows))
print(slice_rows)
```

```python
class 'pandas.core.frame.DataFrame'

                          rank  revenues  profits country
State Grid                   2    315199   9571.3   China
Sinopec Group                3    267518   1257.9   China
China National Petroleum     4    262573   1867.5   China
Toyota Motor                 5    254694  16899.3   Japan
```

Let's take a look at a summary of all the different label selection methods we've learned so far:


<img width="800" src="https://drive.google.com/uc?export=view&id=1rQMPkOZBVh57x6kVu5qWjU3UC6vCh9v_">


Now for some practice - we're going to make it a little bit harder this time, by asking you to combine selection methods for rows and columns on both dataframes and series!

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. By selecting data from **f500**:
  - Create a new variable, **drink_companies**, with:
    - Rows with indicies **Anheuser-Busch InBev**, **Coca-Cola**, and **Heineken Holding**, in that order.
     - All columns.
  - Create a new variable **big_movers**, with:
    - Rows with indicies **Aviva**, **HP**, **JD.com**, and **BHP Billiton**, in that order.
    - The **rank** and **previous_rank** columns, in that order.
  - Create a new variable, **middle_companies** with:
    - All rows with indicies from **Tata Motors** to **Nationwide**, inclusive.
    - All columns from **rank** to **country**, inclusive.

In [0]:
# put your code here

### 1.7 Series and Dataframe Describe Methods



We're starting to get a feel for how axes labels in pandas make selecting data much easier. Pandas also has a large number of methods and functions that make working with data easier. Let's use a few of these to explore our Fortune 500 data.

The first method we'll learn about is the [Series.describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html#pandas.Series.describe) method, which returns some descriptive statistics on the data contained within a specific pandas series. Let's look at an example:

```python
revs = f500["revenues"]
print(revs.describe())
```

```python
count       500.000000
mean      55416.358000
std       45725.478963
min       21609.000000
25%       29003.000000
50%       40236.000000
75%       63926.750000
max      485873.000000
Name: revenues, dtype: float64
```

We've assigned the **revenues** column to a new series, **revs**, and then used the **describe()** method on that series. The method tells us how many non-null values are contained in the series, the mean and standard devation, along with the minimum, maximum and [quartile](https://en.wikipedia.org/wiki/Quartile) values.

Rather than assigning the series to it's own variable, we can actually skip that step and use the method directly on the result of the column selection. This is called **method chaining** and is a way to combine multiple methods together in a single line. It's not unique to pandas, however it is something that you see a lot in pandas code. Let's see what the command looks like with method chaining, using the **assets** column.


```python
print(f500["assets"].describe())
```

```python
count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64
```

From here, you'll start to see method chaining used more in our missions. When writing code, you should always assess whether method chaining will make your code harder to read. It's always preferable to break out into more than one line if it will make your code easier to understand.

You might have noticed that the values in the code segment above look a little bit different. Because the values for this column are too long to display neatly, pandas has displayed them in **E-notation**, a type of [scientific notation](https://en.wikipedia.org/wiki/Scientific_notation). Here is an expansion of what the E-notation represents:

| Original Notation | Expanded Formula | Result |
|-------------------|--------------------|----------|
| 5.000000E+02 | 5.000000 * 10 ** 2 | 500 |
| 2.436323E+05 | 2.436323 * 10 ** 5 | 243632.3 |


If we use **describe()** on a column that contains non-numeric values, we get some different statistics. Let's look at an example:

```python
print(f500["country"].describe())

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object
```

Here is what the output indicates:

The first statistic, **count**, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:

- **unique** - The number of unique values in the series. In this case, it tells us that there are 34 different countries represented in the Fortune 500.
- **top** - The most common value in the series. The USA is the most common country that a company in the Fortune 500 is headquartered in.
- **freq** - The frequency of the most common value. The USA is the country that 132 companies from Fortune 500 are headquartered in.

Because series' and dataframes are two distinct objects, they have their own unique methods. There are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. DataFrame objects also have a [DataFrame.describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method that returns these same statistics for every column. If you like, you can take a look at the documentation using the link in the previous sentence to familiarize yourself with some of the differences between the two methods.

One difference is that you need to specify manually if you want to see the statistics for the non-numeric columns. By default, **DataFrame.describe()** will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the **include=['O']** parameter when using the dataframe version of describe:

```python
print(f500.describe(include=['O']))
```

```python
_            ceo    industry     sector  country  hq_location    website
count        500         500        500      500          500        500
unique       500          58         21       34          235        500
top     Xavie...   Banks:...  Financ...      USA  Beijing,...  http:/...
freq           1          51        118      132           56          1
```

Another difference is that **Series.describe()** returns a series object, where **DataFrame.describe()** returns a dataframe object. 

Let's practice using both the series and dataframe describe methods:


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Use the appropriate **describe()** method to:
  - Return a series of descriptive statistics for the **profits** column, and assign the result to **profits_desc**.
  - Return a dataframe of descriptive statistics for the **revenues** and **employees** columns, in order, and assign the result to **revenue_and_employees_desc**.
  - Return a dataframe of descriptive statistics for every column in the **f500** dataframe, by checking the documentation for the correct value for the **include** parameter, and assign the result to **all_desc**.




In [0]:
# put your code here

### 1.8 More Data Exploration Methods



One basic concept in Pandas are the vectorized operations. Let's look at an example of how this would work with a pandas series:

```python
>>> print(my_series)

    0    1
    1    2
    2    3
    3    4
    4    5
    dtype: int64

>>> my_series = my_series + 10

>>> print(my_series)

    0    11
    1    12
    2    13
    3    14
    4    15
    dtype: int64
```

Many of the descriptive stats methods are also supported. Here are a few handy methods (with links to documentation) that you might use when working with data in pandas:

- [Series.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html) and [DataFrame.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html)
- [Series.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html) and [DataFrame.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html)
- [Series.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) and [DataFrame.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html)
- [Series.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html) and [DataFrame.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html)
- [Series.mode()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html) and [DataFrame.mode()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html)
- [Series.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html) and [DataFrame.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)


As the documentation indicates, the series methods don't require an axis parameter, however the dataframe methods will so we know which axis to calculate across. While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings **"index"** and **"columns"** for the axis parameter. Let's refresh our memory on how this works:

<img width="700" src="https://drive.google.com/uc?export=view&id=1euiSMOgXE7IVP_U-VIRzwV6JAXBpC4yx">

For instance, if we wanted to find the median (middle) value for the **revenues** and **profits** columns, we could use the following code:

```python
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index")
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64
  
```



In fact, the default value for the axis parameter with these methods is **axis=0**, so we could have just used the **median()** method without a parameter to get the same result!

Another extremely handy method for exploring data in pandas is the [Series.value_counts()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method. The **Series.value_counts()** method displays each unique non-null value from a series, with a count of the number of times that value is used. We saw above that the **sector** column has 21 unique values. Let's use **Series.value_counts()** to look at the top 5:

```python
>>> print(f500["sector"].value_counts().head())

    Financials                118
    Energy                     80
    Technology                 44
    Motor Vehicles & Parts     34
    Wholesalers                28
    Name: sector, dtype: int64
```

Let's take a moment to walk through what happened in that line of code:

- We used the **print()** function to print the output of the following method chain:
    - Select the **sector** column from the **f500** dataframe, and on the resulting series
    - Use the **Series.value_counts()** to produce a series of the unique values and their counts in order, and on the resulting series
    - Use the [Series.head()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.head.html#pandas.Series.head) method to return the first 5 items only.
    
    
We haven't seen the **Series.head()** method before, but it works similarly to **DataFrame.head()**, returning the first five items from a series, or a different number if you provide an argument.

The **Series.value_counts()** method is one of the handiest methods to use when exploring a data set. It's also one of the few series methods that doesn't have a dataframe counterpart.

Don't worry too much about having to remember which methods belong to which objects for now. You'll find that as you practice them some will stick, and for the rest you'll be able to reference the pandas documentation.

Let's start the process by practicing some of these to explore the Fortune 500 some more!

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Use **Series.value_counts()** and **Series.head()** to return the 5 most common values for the **country** column, and assign the results to **top5_countries.**
- Use **Series.value_counts()** and **Series.head()** to return the 5 most common values for the **previous rank** column, and assign the results to **top5_previous_rank**.
- Use the appropriate **max()** method to find the maximum value for only the numeric columns from **f500** (you may need to check the documentation), and assign the result to the variable **max_f500**.






In [0]:
# put your code here

### 1.9 Assignment with pandas



Looking at the results of the most common values for the **previous_rank** column in the last exercise, you might have noticed something a little odd:

```python
>>> print(top5_previous_rank.head())

    0      33
    159     1
    147     1
    148     1
    149     1
    Name: previous_rank, dtype: int64
```

This indicates that 33 companies had the value **0** for their rank in the Fortune 500 for the previous year. Given that a rank of zero doesn't exist, we might conclude that these companies didn't have a rank at all for the previous year. It would make more sense for us to replace these values with a null value to more clearly indicate that the value is missing. There are a few things we need to be able to do before we can correct this. The first is how to assign values using pandas.

When we used NumPy, we learned that the same techniques that we use to select data could be used for assignment. Let's look at an example:


```python
my_array = np.array([1, 2, 3, 4])

# to perform selection
print(my_array[0])

# to perform assignment
my_array[0] = 99
```

The same is true with pandas. Let's look at this example:

```python
>>> top5_rank_revenue = f500[["rank", "revenues"]].head()

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1    485873
    State Grid                   2    315199
    Sinopec Group                3    267518
    China National Petroleum     4    262573
    Toyota Motor                 5    254694

>>> top5_rank_revenue["revenues"] = 0

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1         0
    State Grid                   2         0
    Sinopec Group                3         0
    China National Petroleum     4         0
    Toyota Motor                 5         0
    
```


When we selected a whole column by label and use assignment, we assigned the value to every item in that column.

By providing labels for both axes, we can assign to a single value within our dataframe.

```python
>>> top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1         0
    State Grid                   2         0
    Sinopec Group                3       999
    China National Petroleum     4         0
    Toyota Motor                 5         0
```

If we assign a value using a index or column label that does not exist, pandas will create a new row or column in our dataframe. Let's add a new column and new row to our **top5_rank_revenue** dataframe:

```python
>>> top5_rank_revenue["year_founded"] = 0

>>> print(top5_rank_revenue)

                              rank  revenues  year_founded
    Walmart                      1         0             0
    State Grid                   2         0             0
    Sinopec Group                3       999             0
    China National Petroleum     4         0             0
    Toyota Motor                 5         0             0

>>> top5_rank_revenue.loc["My New Company"] = 555

>>> print(top5_rank_revenue)

                              rank  revenues  year_founded
    Walmart                      1         0             0
    State Grid                   2         0             0
    Sinopec Group                3       999             0
    China National Petroleum     4         0             0
    Toyota Motor                 5         0             0
    My New Company             555       555           555
```


There is one exception to be aware of: You **can't** create a new row/column by attempting to use the dot accessor shortcut with a label that does not exist.

Let's practice assigning values and adding new columns using our full Fortune 500 dataframe:


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Add a new column, **revenues_b** to the **f500** dataframe by using vectorized division to divide the values in the existing **revenues** column by 1000 (converting them from millions to billions).
- The company **'Dow Chemical'** have named a new CEO. Update the value where the index label is **Dow Chemical** and for the **ceo** column to **Jim Fitterling**.

In [0]:
# put your code here

### 1.10 Using Boolean Indexing with pandas Objects



Now that we know how assign values in pandas, we're one step closer to being able to correct the values in the **previous_rank** column that are **0**. If we knew the name of every single row label where this case was true, we could do this manually by using a list of labels when we performed our assignment.

While it's helpful to be able to replace specific values in rows where we know the row label ahead of time, this is cumbersome when we want to do this for all rows that meet the same criteria. Another option would be to use a loop, but this would be slower and would lose the benefits of vectorization that pandas gives us. Instead, we can use **boolean indexing**.

Just like NumPy, pandas allows us to use boolean indexing to select items based on their value, which will make our task a lot easier. Let's refresh our memory of how boolean indexing is used for selection, and learn how boolean indexing works in pandas.

In NumPy, boolean arrays are created by performing a vectorized boolean comparison on a NumPy ndarray. In pandas this works almost identically, however the resulting boolean object will be either a series or a dataframe, depending on the object on which the boolean comparison was performed. Let's look an example of performing a boolean comparison on a series vs a dataframe:

<img width="600" src="https://drive.google.com/uc?export=view&id=1BJNbnwxzO7TBjbYhFJppw1o93Tnn7xMU">


It's much less common to use a boolean dataframe than a boolean series in pandas. You almost always want to use the results of a comparison on one column from dataframe (a series object) to select data in the main dataframe, or a selection of the main dataframe.

Let's look at two examples of how that works in diagram form. For our example, we'll be working with this dataframe of people and their favorite numbers:

<img width="600" src="https://drive.google.com/uc?export=view&id=1FqhK-Kfr7u7JDeAbfEFxn3_0nONlIpp1">

Let's check which people have a favorite number of 8. We perform a vectorized boolean operation that produces a boolean series:

<img width="600" src="https://drive.google.com/uc?export=view&id=1OgQSNM8KwzI5Mr3UJ4Y-89tXdc9t6Ybg">


We can use that series to index the whole dataframe, leaving us the rows that correspond only to people whose favorite number is 8.

<img width="600" src="https://drive.google.com/uc?export=view&id=1WeDutjUzaV2RqdHgJkSVWQNPOguqskak">

Note that we didn't used **loc[]**. This is because boolean arrays use the same shortcut as slices to select along the index axis. 


Now let's look at an example of using boolean indexing with our Fortune 500 dataset. We want find out which are the 5 most common countries for companies belonging to the **'Motor Vehicles and Parts'** industry.

We start by making a boolean series that shows us which rows from our dataframe have the value of **Motor Vehicles and Parts** for the **industry** column. We'll then print the first five items of our boolean series so we can see it in action:


```python
>>> motor_bool = f500["industry"] == "Motor Vehicles and Parts"

>>> print(motor_bool.head())

    Walmart                     False
    State Grid                  False
    Sinopec Group               False
    China National Petroleum    False
    Toyota Motor                 True
    Name: industry, dtype: bool
```


Notice that like our examples in the diagrams above, the index labels are retained in our boolean series. Next, we use that boolean series to select only the rows that have **True** for our boolean index, and just the **country** column, and then print the first 5 items to check the values:

```python
>>> motor_countries = f500.loc[motor_bool, "country"]

>>> print(motor_countries.head())

    Toyota Motor        Japan
    Volkswagen        Germany
    Daimler           Germany
    General Motors        USA
    Ford Motor            USA
    Name: country, dtype: object
```


Lastly, we can use the **value_counts()** method for the **motor_countries** series, chained to the **head()** method to produce a series of the top 5 countries for the 'Motor Vehicles and Parts' industry:

```python
>>> top5_motor_countries = motor_countries.value_counts().head()

>>> print(top5_motor_countries)

    Japan          10
    China           7
    Germany         6
    France          3
    South Korea     3
    Name: country, dtype: int64
```

Let's practice using boolean indexing in pandas to identify the five highest ranked companies from South Korea. Remember, we observed earlier that the **f500** dataframe is already sorted by rank, so we won't need to perfom any extra sorting.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Create a boolean series, **kr_bool**, that compares whether the values in the **country** column from the **f500** dataframe are equal to **"South Korea"**
- Use that boolean series to index the full **f500** dataframe, assigning just the first five rows to **top_5_kr.**

In [0]:
# put your code here

### 1.11 Using Boolean Arrays to Assign Values



We now have all the knowledge we need to fix the 0 values in the **previous_rank** column:

- perform assignment in pandas
- use boolean indexing in pandas

Let's look at an example of how we combine these two operations together. For our example, we'll want to change the **'Motor Vehicles & Parts'** values in the **sector** column to **'Motor Vehicles and Parts'** – i.e. we will change the ampersand **(&)** to **and**.

First, we create a boolean series by comparing the values in the sector column to **'Motor Vehicles & Parts'**.

```python
ampersand_bool = f500["sector"] == "Motor Vehicles & Parts"
```

Next, we use that boolean series and the string **"sector"** to perform the assignment.

```python
f500.loc[ampersand_bool,"sector"] = "Motor Vehicles and Parts"
```

Just like we saw in the NumPy mission earlier in this course, we can remove the intermediate step of creating a boolean series, and combine everything into one line. This is the most common way to write pandas code to perform assignment using boolean arrays:

```python
f500.loc[f500["sector"] == "Motor Vehicles & Parts","sector"] = "Motor Vehicles and Parts"
```

Now we can follow this pattern to replace the values in the **previous_rank** column. We'll replace these values with **np.nan**, which is used in pandas, just as it is in numpy, to represent values that can't be represented numerically, most commonly missing values.

To make comparing the values in this column before and after our operation easier, we've added the following line of code to the cell below:

```python
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
```

This uses **Series.value_counts()** and **Series.head()** to display the 5 most common values in the **previous_rank** column, but adds an additional **dropna=False** parameter, which stops the **Series.value_counts()** method from excluding null values when it makes its calculation, as shown in the [Series.value_counts() documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html#pandas.Series.value_counts).


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Use boolean indexing to update values in the **previous_rank** column of the **f500** dataframe:
  - Where previous there was a value of 0, there should now be a value of **np.nan**.
  - It is up to you whether you assign the boolean series to its own variable first, or whether you complete the operation in one line.
- Create a new pandas series, **prev_rank_after**, using the same syntax that was used to create the **prev_rank_before series.**
- After you have run your code, use the variable inspector to compare **prev_rank_before** and **prev_rank_after.**

In [0]:
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

# put your code here

### 1.12 Challenge: Top Performers by Country



You may have noticed that after we assigned NaN values the previous_rank column changed dtype. Let's take a closer look:

```python
>>> print(prev_rank_before)

    0      33
    159     1
    147     1
    148     1
    149     1

>>> print(prev_rank_after)

    NaN      33
    471.0     1
    234.0     1
    125.0     1
    166.0     1
```

The index of the series that **Series.value_counts()** produces is now showing us floats like 471.0 instead of the integers from before. The reason behind this is that pandas uses the NumPy integer dtype, which does not support NaN values. Pandas inherits this behavior, and in instances where you try and assign a NaN value to an integer column, pandas will silently convert that column to a float dtype. If you're interested in finding out more about this, [there is a specific section on integer NaN values in the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#nan-integer-na-values-and-na-type-promotions).

We'll finish this mission with a challenge. In this challenge, we'll calculate a specific statistic or attribute of each of the three most common countries from our f500 dataframe. We've identified the three most common countries using the code below:

```python
>>> top_3_countries = f500["country"].value_counts().head(3)

>>> print(top_3_countries)

USA      132
China    109
Japan     51
Name: country, dtype: int64
```

Don't be discouraged if this takes a few attempts to get right– working with data is an iterative process!

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Create a series, **cities_usa**, containing counts of the five most common Headquarter Location cities for companies headquartered in the USA.
2. Create a series, **sector_china**, containing counts of the three most common sectors for companies headquartered in the China.
3. Create float object, **mean_employees_japan**, containing the mean average number of employees for companies headquartered in Japan

In [0]:
# put your code here

In this lesson, we learned:

- How pandas can be combined to make working with data easier
- About the two core pandas types: series and dataframes
- How to select data from pandas objects using axis labels
- How to select data from pandas objects using boolean arrays
- How to assign data using labels and boolean arrays
- How to create new rows and columns in pandas
- Many new methods to make data analysis easier in pandas.