# MPG Cars

### Introduction:

The following exercise utilizes data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

### Step 1. Import the necessary libraries

In [5]:
import pandas as pd
import numpy as np

### Step 2. Import the first dataset [cars1](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars1.csv) and [cars2](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars2.csv).  

Data Set Information:

This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original". 

"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)


Attribute Information:

1. mpg: continuous 
2. cylinders: multi-valued discrete 
3. displacement: continuous 
4. horsepower: continuous 
5. weight: continuous 
6. acceleration: continuous 
7. model year: multi-valued discrete 
8. origin: multi-valued discrete 
9. car name: string (unique for each instance)

   ### Step 3. Assign each to a to a variable called cars1 and cars2

In [7]:
# cars1 = pd.read_csv("https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars1.csv")
# cars2 = pd.read_csv("https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars2.csv")

cars1 = pd.read_csv("/Users/markyashar/sf16_ids1/class_materials/pandas_exercises/05_Merge/Auto_MPG/cars1.csv")
cars2 = pd.read_csv("/Users/markyashar/sf16_ids1/class_materials/pandas_exercises/05_Merge/Auto_MPG/cars2.csv")

print cars1.head()
print cars2.head()

    mpg  cylinders  displacement horsepower  weight  acceleration  model  \
0  18.0          8           307        130    3504          12.0     70   
1  15.0          8           350        165    3693          11.5     70   
2  18.0          8           318        150    3436          11.0     70   
3  16.0          8           304        150    3433          12.0     70   
4  17.0          8           302        140    3449          10.5     70   

   origin                        car  Unnamed: 9  Unnamed: 10  Unnamed: 11  \
0       1  chevrolet chevelle malibu         NaN          NaN          NaN   
1       1          buick skylark 320         NaN          NaN          NaN   
2       1         plymouth satellite         NaN          NaN          NaN   
3       1              amc rebel sst         NaN          NaN          NaN   
4       1                ford torino         NaN          NaN          NaN   

   Unnamed: 12  Unnamed: 13  
0          NaN          NaN  
1          NaN

### Step 4. Oops it seems our first dataset has some unnamed blank columns, fix cars1

In [8]:
cars1 = cars1.loc[:, "mpg":"car"]

# Docstring:
# Purely label-location based indexer for selection by label.

# ``.loc[]`` is primarily label based, but may also be used with a
# boolean array.

# Allowed inputs are:

# - A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
#   interpreted as a *label* of the index, and **never** as an
#   integer position along the index).
# - A list or array of labels, e.g. ``['a', 'b', 'c']``.
# - A slice object with labels, e.g. ``'a':'f'`` (note that contrary
#   to usual python slices, **both** the start and the stop are included!).
# - A boolean array.

# ``.loc`` will raise a ``KeyError`` when the items are not found.

cars1.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car
0,18.0,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302,140,3449,10.5,70,1,ford torino


### Step 5. What is the number of observations in each dataset?

In [14]:
print cars1.shape # cars1 dataset has 198 observations
print cars2.shape # cars2 dataset has 200 observations

(198, 9)
(200, 9)


### Step 6. Join cars1 and cars2 into a single DataFrame called cars

In [9]:
cars = cars1.append(cars2)

# Signature: cars1.append(other, ignore_index=False, verify_integrity=False)
# Docstring:
# Append rows of `other` to the end of this frame, returning a new
# object. Columns not in this frame are added as new columns.

# Parameters
# ----------
# other : DataFrame or Series/dict-like object, or list of these
#     The data to append.
# ignore_index : boolean, default False
#     If True, do not use the index labels.
# verify_integrity : boolean, default False
#     If True, raise ValueError on creating index with duplicates.

# Returns
# -------
# appended : DataFrame

# Notes
# -----
# If a list of dict/series is passed and the keys are all contained in
# the DataFrame's index, the order of the columns in the resulting
# DataFrame will be unchanged.

# See also
# --------
# pandas.concat : General function to concatenate DataFrame, Series
#     or Panel objects

# Examples
# --------

# >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
# >>> df
#    A  B
# 0  1  2
# 1  3  4
# >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
# >>> df.append(df2)
#    A  B
# 0  1  2
# 1  3  4
# 0  5  6
# 1  7  8

# With `ignore_index` set to True:

# >>> df.append(df2, ignore_index=True)
#    A  B
# 0  1  2
# 1  3  4
# 2  5  6
# 3  7  8

cars

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car
0,18.0,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302,140,3449,10.5,70,1,ford torino
5,15.0,8,429,198,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454,220,4354,9.0,70,1,chevrolet impala
7,14.0,8,440,215,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455,225,4425,10.0,70,1,pontiac catalina
9,15.0,8,390,190,3850,8.5,70,1,amc ambassador dpl


### Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

In [10]:
nr_owners = np.random.randint(15000, high=73001, size=398, dtype='l')

# Docstring:
# ========================
# Random Number Generation
# ========================

# ==================== =========================================================
# Utility functions
# ==============================================================================
# random_sample        Uniformly distributed floats over ``[0, 1)``.
# random               Alias for `random_sample`.
# bytes                Uniformly distributed random bytes.
# random_integers      Uniformly distributed integers in a given range.
# permutation          Randomly permute a sequence / generate a random sequence.
# shuffle              Randomly permute a sequence in place.
# seed                 Seed the random number generator.
# choice               Random sample from 1-D array.

# ==================== =========================================================

# ==================== =========================================================
# Compatibility functions
# ==============================================================================
# rand                 Uniformly distributed values.
# randn                Normally distributed values.
# ranf                 Uniformly distributed floating point numbers.
# randint              Uniformly distributed integers in a given range.
# ==================== =========================================================

# ==================== =========================================================
# Univariate distributions
# ==============================================================================
# beta                 Beta distribution over ``[0, 1]``.
# binomial             Binomial distribution.
# chisquare            :math:`\chi^2` distribution.
# exponential          Exponential distribution.
# f                    F (Fisher-Snedecor) distribution.
# gamma                Gamma distribution.
# geometric            Geometric distribution.
# gumbel               Gumbel distribution.
# hypergeometric       Hypergeometric distribution.
# laplace              Laplace distribution.
# logistic             Logistic distribution.
# lognormal            Log-normal distribution.
# logseries            Logarithmic series distribution.
# negative_binomial    Negative binomial distribution.
# noncentral_chisquare Non-central chi-square distribution.
# noncentral_f         Non-central F distribution.
# normal               Normal / Gaussian distribution.
# pareto               Pareto distribution.
# poisson              Poisson distribution.
# power                Power distribution.
# rayleigh             Rayleigh distribution.
# triangular           Triangular distribution.
# uniform              Uniform distribution.
# vonmises             Von Mises circular distribution.
# wald                 Wald (inverse Gaussian) distribution.
# weibull              Weibull distribution.
# zipf                 Zipf's distribution over ranked data.
# ==================== =========================================================

# ==================== =========================================================
# Multivariate distributions
# ==============================================================================
# dirichlet            Multivariate generalization of Beta distribution.
# multinomial          Multivariate generalization of the binomial distribution.
# multivariate_normal  Multivariate generalization of the normal distribution.
# ==================== =========================================================

# ==================== =========================================================
# Standard distributions
# ==============================================================================
# standard_cauchy      Standard Cauchy-Lorentz distribution.
# standard_exponential Standard exponential distribution.
# standard_gamma       Standard Gamma distribution.
# standard_normal      Standard normal distribution.
# standard_t           Standard Student's t-distribution.
# ==================== =========================================================

nr_owners

array([52519, 24584, 16530, 71477, 63869, 40076, 43902, 61788, 49539,
       68469, 41854, 62681, 68047, 55636, 39420, 71437, 60756, 47757,
       24382, 30083, 28502, 50786, 26548, 70664, 51571, 46447, 47905,
       50125, 29329, 23333, 61014, 70669, 55412, 55172, 23110, 38382,
       32241, 43471, 38688, 24645, 20562, 19357, 21691, 59666, 40500,
       44501, 38071, 63052, 46498, 47223, 57992, 55413, 61466, 28950,
       21958, 62135, 25424, 33947, 27188, 72374, 53243, 70845, 48289,
       17739, 16068, 28778, 50290, 16555, 64734, 64739, 53028, 30332,
       16633, 64492, 72081, 70407, 54940, 61701, 20903, 44624, 59547,
       52502, 60058, 39724, 19347, 24926, 37509, 49409, 25465, 51965,
       56600, 38684, 35826, 56605, 18063, 57333, 38617, 19127, 30152,
       63800, 56530, 68687, 69054, 25290, 67862, 18040, 57376, 50602,
       53028, 34684, 41591, 52530, 67482, 29280, 70804, 26338, 42360,
       35175, 49865, 47764, 35126, 21139, 72583, 16574, 62625, 18663,
       55616, 42233,

### Step 8. Add the column owners to cars

In [11]:
cars['owners'] = nr_owners
cars.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car,owners
195,27.0,4,140,86,2790,15.6,82,1,ford mustang gl,37293
196,44.0,4,97,52,2130,24.6,82,2,vw pickup,44813
197,32.0,4,135,84,2295,11.6,82,1,dodge rampage,22733
198,28.0,4,120,79,2625,18.6,82,1,ford ranger,28674
199,31.0,4,119,82,2720,19.4,82,1,chevy s-10,52209
