We learned some of the ways pandas makes working with data easier than NumPy:

- Axis values in dataframes can have string labels, not just numeric ones, which makes selecting data much easier.
- Dataframes can contain columns with multiple data types: including `integer`, `float`, and `string`.

In [1]:
import pandas as pd

f500 = pd.read_csv("f500.csv", index_col = 0)
f500.index.name = None
f500_head = f500.head(10)
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 66.4+ KB


Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported. Recall that one of the ways NumPy makes working with data easier is with vectorized operations, or operations applied to multiple data points at once:

Vectorization not only improves our code's performance, but also enables us to write code more quickly.

Because pandas is an extension of NumPy, it also supports vectorized operations.

In [2]:
rank_change = f500["previous_rank"] - f500["rank"]
rank_change.head()

Walmart                     0
State Grid                  0
Sinopec Group               1
China National Petroleum   -1
Toyota Motor                3
dtype: int64

Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones:

* `Series.max()`
* `Series.min()`
* `Series.mean()`
* `Series.median()`
* `Series.mode()`
* `Series.sum()`

In [3]:
rank_change_max = rank_change.max()
rank_change_min = rank_change.min()

However, according to the data dictionary, this list should only rank companies on a scale of 1 to 500. Even if the company ranked 1st in the previous year moved to 500th this year, the rank change calculated would be -499. This indicates that there is incorrect data in either the `rank` column or `previous_rank` column.

We'll learn another method that can help us more quickly investigate this issue - the `Series.describe()` method. This method tells us how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics

If we use `describe()` on a column that contains **non-numeric** values, we get some different statistics.

The first statistic, `count`, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:

- `unique`: Number of unique values in the series. 
- `top`: Most common value in the series. 
- `freq`: Frequency of the most common value.

In [4]:
rank = f500["rank"]
rank_desc = rank.describe()
rank_desc

count    500.000000
mean     250.500000
std      144.481833
min        1.000000
25%      125.750000
50%      250.500000
75%      375.250000
max      500.000000
Name: rank, dtype: float64

In [5]:
prev_rank = f500["previous_rank"]
prev_rank_desc = prev_rank.describe()
prev_rank_desc

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64

The results we might have noticed something odd - the minimum value for the `previous_rank` column is 0:

However, this column should only have values between 1 and 500 (inclusive), so a value of 0 doesn't make sense. To investigate the possible cause of this issue, let's confirm the number of 0 values that appear in the `previous_rank` column.

In [6]:
zero_previous_rank = f500["previous_rank"].value_counts().loc[0] # This is called method chaining — a way to combine multiple methods together in a single line
zero_previous_rank

33

We confirmed that 33 companies in the dataframe have a value of 0 in the `previous_rank` column. Given that multiple companies have a 0 rank, we might conclude that these companies didn't have a rank at all for the previous year. It would make more sense for us to replace these values with a null value instead.

Before we correct these values, let's explore the rest of our dataframe to make sure there are no other data issues. Just like we used descriptive stats methods to explore individual series, we can also use descriptive stats methods to explore our f500 dataframe.

Because series and dataframes are two distinct objects, they have their own unique methods. However, there are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. Below are some examples:

- `Series.max()` and `DataFrame.max()`
- `Series.min()` and `DataFrame.min()`
- `Series.mean()` and `DataFrame.mean()`
- `Series.median()` and `DataFrame.median()`
- `Series.mode()` and `DataFrame.mode()`
- `Series.sum()` and `DataFrame.sum()`

Unlike their series counterparts, dataframe methods require an `axis` parameter so we know which axis to calculate across. While we can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings `"index"` and `"columns"` for the axis parameter:

The default value for the axis parameter with these methods is `axis=0`

In [7]:
max_f500 = f500.max(numeric_only=True) # calcualte a max value of only the numeric columns

Like series objects, dataframe objects also have a `DataFrame.describe()` method that we can use to explore the dataframe more quickly

By default, `DataFrame.describe()` will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the `include=['O']` parameter:

In [8]:
f500_desc = f500.describe() # descriptive statistics for all of the numeric columns
f500_desc

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


In [9]:
desc_object = f500.describe(include= ["O"]) # Statistics for the non-numeric columns
desc_object

Unnamed: 0,ceo,industry,sector,country,hq_location,website
count,500,500,500,500,500,500
unique,500,58,21,34,235,500
top,Carlos Brito,Banks: Commercial and Savings,Financials,USA,"Beijing, China",http://www.citigroup.com
freq,1,51,118,132,56,1


# assignment of value

In [10]:
f500.loc["Dow Chemical", "ceo"]

'Andrew N. Liveris'

The company "Dow Chemical" has named a new CEO $Jim      Fitterling$

In [11]:
f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"
f500.loc["Dow Chemical", "ceo"] 

'Jim Fitterling'

In [12]:
# Countries that industry is vehicles and parts

motor_bool = f500["industry"] == "Motor Vehicles and Parts"
motor_countries = f500.loc[motor_bool,"country"]

In [13]:
previous_rank_before = f500["previous_rank"].value_counts().head()
previous_rank_before

0      33
159     1
147     1
148     1
149     1
Name: previous_rank, dtype: int64

In [14]:
import numpy as np
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan # note: if we use only bool indexing at row axis, it work as row slicing. so we will not use .loc().
# like in NumPy, np.nan is used in pandas to represent values that can't be represented numerically, most commonly missing values

In [18]:
prev_rank_after = f500["previous_rank"].value_counts(dropna = False).head()
prev_rank_after

NaN       33
 471.0     1
 234.0     1
 125.0     1
 166.0     1
Name: previous_rank, dtype: int64

Noticed that after we assigned `NaN` values, the `previous_rank` column changed dtype.

The index of the series that `Series.value_counts()` produces now shows us floats like 471.0 instead of integers. The reason behind this is that pandas uses the NumPy integer dtype, which does not support `NaN` values

In [19]:
#  create a rank_change column in our f500 dataframe

f500["rank_change"] = f500["previous_rank"] - f500["rank"]

In [20]:
# descriptive statistics for the rank_change column

rank_change_desc = f500["rank_change"].describe()
rank_change_desc

count    467.000000
mean      -3.533191
std       44.293603
min     -199.000000
25%      -21.000000
50%       -2.000000
75%       10.000000
max      226.000000
Name: rank_change, dtype: float64

# Challenge

Calculate a specific statistic or attribute of each of the three most common countries from our f500 dataframe.

In [29]:
# Create a series, industry_usa, containing counts of the two most common industries for companies headquartered in the USA.

industry_usa = f500.loc[f500["country"] == "USA","industry"].value_counts().head(2)
industry_usa

Banks: Commercial and Savings               8
Insurance: Property and Casualty (Stock)    7
Name: industry, dtype: int64

In [30]:
# Create a series, sector_china, containing counts of the three most common sectors for companies headquartered in the China.
sector_china =  f500.loc[f500["country"] == "China","sector"].value_counts().head(3)
sector_china

Financials     25
Energy         22
Wholesalers     9
Name: sector, dtype: int64

In [33]:
# Create a float object, mean_employees_japan, containing the mean (average) number of employees for companies headquartered in Japan
mean_employees_japan = (f500.loc[f500["country"] == "Japan","employees"]).mean()
mean_employees_japan

104564.45098039215