# Moving data from numpy arrays to pandas DataFrames
In our last notebook we trained a model and compared our actual and predicted results

What may not have been evident was when we did this we were working with two different objects: a **numpy array** and a **pandas DataFrame**

To explore further let's rerun the code from the previous notebook to create a trained model and get predicted values for our test data

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [4]:
# Load our data from the csv file
delays_df = pd.read_csv('Data/Lots_of_flight_data.csv') 

# Remove rows with null values since those will crash our linear regression model training
delays_df.dropna(inplace=True)

# Move our features into the X DataFrame
X = delays_df.loc[:,['DISTANCE','CRS_ELAPSED_TIME']]

# Move our labels into the y DataFrame
y = delays_df.loc[:,['ARR_DELAY']] 

# Split our data into test and training DataFrames
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.3, 
    random_state=42
)
regressor = LinearRegression()     # Create a scikit learn LinearRegression object
regressor.fit(X_train, y_train)    # Use the fit method to train the model using your training data

y_pred = regressor.predict(X_test)  # Generate predicted values for our test data

In the last Notebook, you might have noticed the output displays differently when you display the contents of the predicted values in y_pred and the actual values in y_test

In [5]:
y_pred

array([[3.47739078],
       [5.89055919],
       [4.33288464],
       ...,
       [5.84678979],
       [6.05195889],
       [5.66255414]])

In [6]:
y_test

Unnamed: 0,ARR_DELAY
291483,-5.0
98997,-12.0
23454,-9.0
110802,-14.0
49449,-20.0
94944,14.0
160885,-17.0
47572,-20.0
164800,20.0
62578,-9.0


Use **type()** to check the datatype of an object.

In [7]:
type(y_pred)

numpy.ndarray

In [8]:
type(y_test)

pandas.core.frame.DataFrame

* **y_pred** is a numpy array
* **y_test** is a pandas DataFrame

Another way you might discover this is if you try to use the **head** method on **y_pred**. 

This will return an error, because **head** is a method of the DataFrame class it is not a method of numpy arrays

In [9]:
y_pred.head()

AttributeError: 'numpy.ndarray' object has no attribute 'head'

A one dimensional numpy array is similar to a pandas Series


In [10]:
import numpy as np
airports_array = np.array(['Pearson','Changi','Narita'])
print(airports_array)
print(airports_array[2])

['Pearson' 'Changi' 'Narita']
Narita


In [11]:
airports_series = pd.Series(['Pearson','Changi','Narita'])
print(airports_series)
print(airports_series[2])

0    Pearson
1     Changi
2     Narita
dtype: object
Narita


A two dimensional numpy array is similar to a pandas DataFrame

In [12]:
airports_array = np.array(
    [
        ['YYZ','Pearson'],
        ['SIN','Changi'],
        ['NRT','Narita']
    ]
)
print(airports_array)
print(airports_array[0,0])

[['YYZ' 'Pearson']
 ['SIN' 'Changi']
 ['NRT' 'Narita']]
YYZ


In [13]:
airports_df = pd.DataFrame([['YYZ','Pearson'],['SIN','Changi'],['NRT','Narita']])
print(airports_df)
print(airports_df.iloc[0,0])

     0        1
0  YYZ  Pearson
1  SIN   Changi
2  NRT   Narita
YYZ


If you need the functionality of a DataFrame, you can move data from numpy objects to pandas objects and vice-versa.

In the example below we use the DataFrame constructor to read the contents of the numpy array *y_pred* into a DataFrame called *predicted_df*

Then we can use the functionality of the DataFrame object

In [14]:
predicted_df = pd.DataFrame(y_pred)
predicted_df.head()

Unnamed: 0,0
0,3.477391
1,5.890559
2,4.332885
3,3.447476
4,5.072394
