In [1]:
import pandas as pd

In [2]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [3]:
ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

#### Q1. How to read only 2 columns from a csv file and ignore others?

In [4]:
## Selecting using column names
ufo = pd.read_csv('http://bit.ly/uforeports', usecols = ['City', 'State'])
ufo.columns

Index(['City', 'State'], dtype='object')

In [5]:
## Selecting using column position
ufo = pd.read_csv('http://bit.ly/uforeports', usecols = [0, 3])
ufo.columns

Index(['City', 'State'], dtype='object')

#### Q2. Faster ways to read large files? / Big dataset and need to look at 1st 2/3 rows

In [6]:
ufo = pd.read_csv('http://bit.ly/uforeports', nrows = 3)
ufo

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00


#### Q3. How do dfs and series work with with regard to selecting individual entries and iteration (for x in userdata:)?

In [7]:
# Iterating through a series
for c in ufo['City']:
    print(c)

Ithaca
Willingboro
Holyoke


In [11]:
# Iterating through a dataframe
for index,row in ufo.iterrows():
    print(index, row['City'], row['State'])

0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO


##### Refer blog: [Optimum approach for iterating over a DataFrame](https://medium.com/@rtjeannier/pandas-101-cont-9d061cb73bfc)

 
1.  __iterrows()__ is a generator that iterates over the rows of the dataframe and returns the index of each row, in addition to an object containing the row itself.     iterrows() is optimized to work with Pandas dataframes, and, although it’s the least efficient way to run most standard functions, it’s a significant improvement over crude looping. In our case, iterrows() solves the same problem almost four times faster than manually looping over rows.

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

In [21]:
## Loading global superstore dataset used in tableau - 8275kb
superstore = pd.read_excel("C:\\Users\\Jeswin\\Documents\\Github\\Data-Analytics-Notes\\Global Superstore Orders 2016.xlsx", sheet_name = 'Orders')
superstore = superstore.loc[:,['Row ID', 'Order ID', 'Order Date', 'Ship Date','Quantity','Shipping Cost']]
superstore.head(2)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Quantity,Shipping Cost
0,40098,CA-2014-AB10015140-41954,2014-11-11,2014-11-13,2,40.77
1,26341,IN-2014-JR162107-41675,2014-02-05,2014-02-07,9,923.63


In [26]:
def test_iterrow(df):
    for (i, row) in df.iterrows():
        if (row['Quantity'] > 5):
            row['Shipping Cost'] -= 5
        
%timeit test_iterrow(df)

29.3 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


2. __pd.loc[]/pd.iloc[]__ - A much better and almost just as simple approach for iterating over a DataFrame is the loc[] and iloc[] functionality.

In [28]:
def test_iterrow(df):
    for (i, row) in df.index:
        if (df.loc[i,'Quantity'] > 5):
            df.loc[i,'Shipping Cost'] -= 5
        
%timeit test_iterrow(df)

15.7 µs ± 656 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


3. __Using apply()__ - Using pd.apply() is not always trivial, but when mastered it is incredibly useful. The apply() method can be called on a single Pandas Series (as I will be showing in this example) or an entire DataFrame. Let’s see how it performs when we try to replicate the logic of our above methods.

#### Q4. Best way to drop every non-numeric column from a dataframe

In [12]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')

In [13]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [14]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [30]:
import numpy as np
drinks.select_dtypes(include = [np.number]).dtypes

beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
dtype: object

In [31]:
## Will give details of all columns 
drinks.describe(include = 'all')

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
count,193,193.0,193.0,193.0,193.0,193
unique,193,,,,,6
top,Sri Lanka,,,,,Africa
freq,1,,,,,53
mean,,106.160622,80.994819,49.450777,4.717098,
std,,101.143103,88.284312,79.697598,3.773298,
min,,0.0,0.0,0.0,0.0,
25%,,20.0,4.0,1.0,1.3,
50%,,76.0,56.0,8.0,4.2,
75%,,188.0,128.0,59.0,7.2,


In [32]:
## Will give details of only selected data types
drinks.describe(include = ['object', 'float64'])

Unnamed: 0,country,total_litres_of_pure_alcohol,continent
count,193,193.0,193
unique,193,,6
top,Sri Lanka,,Africa
freq,1,,53
mean,,4.717098,
std,,3.773298,
min,,0.0,
25%,,1.3,
50%,,4.2,
75%,,7.2,
