---
<center><h1>Basic intro into pandas</h1></center> 

<center><h2>Working with pandas DataFrames: main operations, sorting and selecting by type</h2></center>

---

## Table of Contents
- [Work with pandas DataFrames: main operations, sorting and selecting by type](#Work-with-pandas-DataFrames:-main-operations,-sorting-and-selecting-by-type)
    * [Flexible comparisons and boolean reductions](#Flexible-comparisons-and-boolean-reductions)
    * [Descriptive statistics](#Descriptive-statistics)
    * [Function application](#Function-application)
    * [Sorting](#Sorting)
    * [Selecting by type](#Selecting-by-type)

In [92]:
import pandas as pd
import numpy as np
import random

## Work with pandas DataFrames: main operations, sorting and selecting by type

[[back to top]](#Table-of-Contents)

In this part we will consider the following questions:
1.	how quickly compare two or more DataFrames or check if Dataframe’s items satisfy any condition.
2.	what main mathematical (computational) and statistical operations may be easily applied to pandas DataFrame's data, i.e. what such operations are build in pandas; 
3.	how to apply an arbitrary function to DataFrame’s items, rows, columns and whole DataFrame and change its data type;
4.	how sort rows and columns data;
5.	how select any column by its type.

At first, let’s find all unique values in `‘Province_State’` column of the COVID 19 dataset

In [93]:
from arcgis.features import GeoAccessor, GeoSeriesAccessor
#Import a COVID data layer.  This layer contains the updated stats for each county in the United States
from arcgis.features import FeatureLayer
mylayer = FeatureLayer(("https://services1.arcgis.com/0MSEUqKaxRlEPj5g/ArcGIS/rest/services/ncov_cases_US/FeatureServer/0"))
sdf2 = pd.DataFrame.spatial.from_layer(mylayer)
sdf2.head(4)

Unnamed: 0,OBJECTID,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Recovered,Deaths,Active,Admin2,FIPS,Combined_Key,Incident_Rate,People_Tested,People_Hospitalized,UID,ISO3,SHAPE
0,1,Alabama,US,2023-03-10 13:21:02,32.539527,-86.644082,19790,,232,,Autauga,1001,"Autauga, Alabama, US",35422.14824,,,84001001,USA,"{""x"": -86.64408226999996, ""y"": 32.539527450000..."
1,2,Alabama,US,2023-03-10 13:21:02,30.72775,-87.722071,69860,,727,,Baldwin,1003,"Baldwin, Alabama, US",31294.516068,,,84001003,USA,"{""x"": -87.72207057999998, ""y"": 30.727749910000..."
2,3,Alabama,US,2023-03-10 13:21:02,31.868263,-85.387129,7485,,103,,Barbour,1005,"Barbour, Alabama, US",30320.82962,,,84001005,USA,"{""x"": -85.38712859999998, ""y"": 31.868263000000..."
3,4,Alabama,US,2023-03-10 13:21:02,32.996421,-87.125115,8091,,109,,Bibb,1007,"Bibb, Alabama, US",36130.21345,,,84001007,USA,"{""x"": -87.12511459999996, ""y"": 32.996420640000..."


In [94]:
sdf2.dtypes

OBJECTID                        Int64
Province_State         string[python]
Country_Region         string[python]
Last_Update            datetime64[ns]
Lat                           Float64
Long_                         Float64
Confirmed                       Int32
Recovered                       Int32
Deaths                          Int32
Active                          Int32
Admin2                 string[python]
FIPS                   string[python]
Combined_Key           string[python]
Incident_Rate                 Float64
People_Tested                   Int32
People_Hospitalized             Int32
UID                             Int32
ISO3                   string[python]
SHAPE                        geometry
dtype: object

In [95]:
# get unique values
unique_states = sdf2['Province_State'].drop_duplicates().dropna()
unique_states

0                    Alabama
69                    Alaska
100                  Arizona
117                 Arkansas
193               California
253                 Colorado
319              Connecticut
328                 Delaware
332     District of Columbia
333                  Florida
401                  Georgia
562                   Hawaii
568                    Idaho
613                 Illinois
717                  Indiana
810                     Iowa
910                   Kansas
1017                Kentucky
1138               Louisiana
1204                   Maine
1222                Maryland
1247           Massachusetts
1261                Michigan
1348               Minnesota
1436             Mississippi
1519                Missouri
1636                 Montana
1693                Nebraska
1787                  Nevada
1805           New Hampshire
1816              New Jersey
1838              New Mexico
1872                New York
1936          North Carolina
2037          

Above we have used `drop_duplicates()` method to select only `unique` Series values. 
Below we can filter the same data frame with respect to `specific states` using isin.

In [96]:
SouthCentraldf = sdf2[sdf2['Province_State'].isin(["Mississippi", "Alabama", "Louisiana"])]
SouthCentraldf

Unnamed: 0,OBJECTID,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Recovered,Deaths,Active,Admin2,FIPS,Combined_Key,Incident_Rate,People_Tested,People_Hospitalized,UID,ISO3,SHAPE
0,1,Alabama,US,2023-03-10 13:21:02,32.539527,-86.644082,19790,,232,,Autauga,01001,"Autauga, Alabama, US",35422.14824,,,84001001,USA,"{""x"": -86.64408226999996, ""y"": 32.539527450000..."
1,2,Alabama,US,2023-03-10 13:21:02,30.72775,-87.722071,69860,,727,,Baldwin,01003,"Baldwin, Alabama, US",31294.516068,,,84001003,USA,"{""x"": -87.72207057999998, ""y"": 30.727749910000..."
2,3,Alabama,US,2023-03-10 13:21:02,31.868263,-85.387129,7485,,103,,Barbour,01005,"Barbour, Alabama, US",30320.82962,,,84001005,USA,"{""x"": -85.38712859999998, ""y"": 31.868263000000..."
3,4,Alabama,US,2023-03-10 13:21:02,32.996421,-87.125115,8091,,109,,Bibb,01007,"Bibb, Alabama, US",36130.21345,,,84001007,USA,"{""x"": -87.12511459999996, ""y"": 32.996420640000..."
4,5,Alabama,US,2023-03-10 13:21:02,33.982109,-86.567906,18704,,261,,Blount,01009,"Blount, Alabama, US",32345.311797,,,84001009,USA,"{""x"": -86.56790592999994, ""y"": 33.982109180000..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1514,1515,Mississippi,US,2023-03-10 13:21:02,33.613005,-89.283929,4069,,76,,Webster,28155,"Webster, Mississippi, US",41996.078027,,,84028155,USA,"{""x"": -89.28392911999998, ""y"": 33.613004860000..."
1515,1516,Mississippi,US,2023-03-10 13:21:02,31.160782,-91.310188,1943,,47,,Wilkinson,28157,"Wilkinson, Mississippi, US",22514.484357,,,84028157,USA,"{""x"": -91.31018818999996, ""y"": 31.160782250000..."
1516,1517,Mississippi,US,2023-03-10 13:21:02,33.087479,-89.033914,6747,,105,,Winston,28159,"Winston, Mississippi, US",37577.276525,,,84028159,USA,"{""x"": -89.03391384999998, ""y"": 33.087479080000..."
1517,1518,Mississippi,US,2023-03-10 13:21:02,34.028242,-89.707621,4831,,63,,Yalobusha,28161,"Yalobusha, Mississippi, US",39899.240172,,,84028161,USA,"{""x"": -89.70762049999996, ""y"": 34.028241750000..."


Now we may filter the resulting DataFrame `SouthCentraldf` by `deaths`

In [97]:
lowDeathsSouthCentraldf = SouthCentraldf[SouthCentraldf['Deaths'] < 50]
lowDeathsSouthCentraldf

Unnamed: 0,OBJECTID,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Recovered,Deaths,Active,Admin2,FIPS,Combined_Key,Incident_Rate,People_Tested,People_Hospitalized,UID,ISO3,SHAPE
11,12,Alabama,US,2023-03-10 13:21:02,32.022273,-88.265644,2259,,39,,Choctaw,1023,"Choctaw, Alabama, US",17944.237032,,,84001023,USA,"{""x"": -88.26564429999996, ""y"": 32.022273410000..."
52,53,Alabama,US,2020-12-21 13:27:30,,,0,,0,,Out of AL,80001,"Out of AL, Alabama, US",,,,84080001,USA,
53,54,Alabama,US,2023-03-10 13:21:02,32.640483,-87.297706,2673,,49,,Perry,1105,"Perry, Alabama, US",29956.292727,,,84001105,USA,"{""x"": -87.29770588999997, ""y"": 32.640483410000..."
64,65,Alabama,US,2020-12-21 13:27:30,,,0,,0,,Unassigned,90001,"Unassigned, Alabama, US",,,,84090001,USA,
67,68,Alabama,US,2023-03-10 13:21:02,31.987732,-87.308911,3659,,48,,Wilcox,1131,"Wilcox, Alabama, US",35274.269739,,,84001131,USA,"{""x"": -87.30891117999994, ""y"": 31.987732150000..."
1149,1150,Louisiana,US,2023-03-10 13:21:02,29.875922,-93.193107,1387,,10,,Cameron,22023,"Cameron, Louisiana, US",19891.008174,,,84022023,USA,"{""x"": -93.19310675999998, ""y"": 29.875922380000..."
1155,1156,Louisiana,US,2023-03-10 13:21:02,32.73927,-91.234257,3349,,39,,East Carroll,22035,"East Carroll, Louisiana, US",48812.126512,,,84022035,USA,"{""x"": -91.23425693999997, ""y"": 32.739269540000..."
1175,1176,Louisiana,US,2022-09-12 23:21:04,,,0,,0,,Out of LA,80022,"Out of LA, Louisiana, US",,,,84080022,USA,
1176,1177,Louisiana,US,2023-03-10 13:21:02,29.422454,-89.603221,7939,,49,,Plaquemines,22075,"Plaquemines, Louisiana, US",34224.253136,,,84022075,USA,"{""x"": -89.60322084999996, ""y"": 29.422454470000..."
1184,1185,Louisiana,US,2023-03-10 13:21:02,30.822103,-90.710132,2276,,27,,St. Helena,22091,"St. Helena, Louisiana, US",22463.482037,,,84022091,USA,"{""x"": -90.71013175999997, ""y"": 30.822103240000..."


We are going to take a break from real data and just show some examples using made up data.  These examples basically compare different data frames (or series).

In [98]:
df_ABC = pd.DataFrame({'A': [1,2,3], 'B': [3,4,5], 'C': [-1,9,-4]})
df_ABC

Unnamed: 0,A,B,C
0,1,3,-1
1,2,4,9
2,3,5,-4


In [99]:
df_ACD = pd.DataFrame({'A': [0,4,9], 'C': [-1,-3,-2], 'D': [0,1,-2]})
df_ACD

Unnamed: 0,A,C,D
0,0,-1,0
1,4,-3,1
2,9,-2,-2


In [100]:
df_ABC.le(df_ACD)

Unnamed: 0,A,B,C,D
0,False,False,True,False
1,True,False,False,False
2,True,False,True,False


As was mentioned above pandas compare elements from the same row and column. 

You can also apply the reductions: `empty`, `any()`, `all()`, and `bool()` to provide a way to summarize a boolean result:

In [101]:
# here vertical direction for comparison is taking into account and we check all column’s items
(df_ACD < 0).all()

A    False
C     True
D    False
dtype: bool

In [102]:
# here horizontal direction for comparison is taking into account and we check all row’s items
(df_ACD < 0).all(axis=1)

0    False
1    False
2    False
dtype: bool

In [103]:
# here vertical direction for comparison is taking into 
# account and we check if just one column’s item satisfies the condition
(df_ACD < 0).any()

A    False
C     True
D     True
dtype: bool

In [104]:
# here we check if all DataFrame's items satisfy the condition
(df_ACD < 0).any().any()

True

In [105]:
# here we check if DataFrame is empty (no elements)
df_ACD.empty

False

Based on the provided above way you can determine the necessary columns with respect to any condition. It’s helpful when need to quickly check if a DataFrame or its some row or columns contain, for instance, all positive values but it does not matter exactly what the elements – it is the main difference between filtering and flexible comparisons.  Remember you can reverse a boolean condition by using the not keyword.

### Descriptive statistics

[[back to top]](#Table-of-Contents)

pandas provides a large number of methods for computing descriptive statistics and other related mathematical operations on Series and DataFrame. Most of these are aggregations but some of them produce an object of the same size. Most of these functions are collected in summary table of common functions:

|Function|Description|
|--|-------------------------------|
|abs|absolute value|
|count|number of non-null observations|
|sum|sum of values|
|mean|mean of values|
|mad|mean absolute deviation|
|median|arithmetic median of values|
|min|minimum value|
|max|maximum value|
|idxmin|position of minimum value|
|idxmax|position of maximum value|
|mode|mode|
|prod|product of values|
|std|unbiased standard deviation|
|var|unbiased variance|
|cumsum|cumulative sum (a sequence of partial sums of a given sequence)|

Let’s demonstrate how you can use these methods:

In [106]:
sdf2['Deaths'].sum()

1123208

In [107]:
sdf2['Deaths'].mean()

343.278728606357

In [108]:
# returns average value for each numerical column  (in scientific notation)
sdf2.mean()

TypeError: Cannot perform reduction 'mean' with string dtype

In [111]:
# average value for numeric
sdf2.mean(numeric_only=True)

OBJECTID                        1636.5
Lat                           37.97145
Long_                       -91.575549
Confirmed                 31692.011308
Recovered                         <NA>
Deaths                      343.278729
Active                            <NA>
Incident_Rate             30629.390382
People_Tested                     <NA>
People_Hospitalized               <NA>
UID                    83519075.055623
dtype: object

In [112]:
sdf2['Deaths'].max(), sdf2['Deaths'].idxmax()

(35545, 211)

In [113]:
#Which County has had the most deaths?
sdf2.iloc[209]

OBJECTID                                                             210
Province_State                                                California
Country_Region                                                        US
Last_Update                                          2023-03-10 13:21:02
Lat                                                            39.101243
Long_                                                        -122.753624
Confirmed                                                          14941
Recovered                                                           <NA>
Deaths                                                               157
Active                                                              <NA>
Admin2                                                              Lake
FIPS                                                               06033
Combined_Key                                        Lake, California, US
Incident_Rate                                      

You can also apply any your own function set before using method `apply`

In [118]:
def my_own_func(x, power, delta=0):
    if x < 20:
        return (x - delta)**power
    elif x >= 20:
        return round(power/x, 2)
    else:
        return  np.nan
    
sdf2['Deaths'].apply(my_own_func, args=(2,), delta=1).head(10)

0    0.01
1    0.00
2    0.02
3    0.02
4    0.01
5    0.04
6    0.02
7    0.00
8    0.01
9    0.02
Name: Deaths, dtype: float64

where the first argument of `apply` method is the function name, the second are `tuple` of all variables without default values, the follow all variables with default values.

To apply any function to each Series element (row or column of a DataFrame) you may use method `map` (please see the type of `'age'` column before; do you remember how it can be done?)

In [119]:
# get 'Deaths' column where NaN replaced by 0
sdf2['Deaths'].map(lambda x: int(x) if pd.notnull(x) else 0).head(10)

0    232
1    727
2    103
3    109
4    261
5     54
6    132
7    680
8    172
9     89
Name: Deaths, dtype: int64

In [120]:
sdf2['Deaths'].fillna(0).astype(int).head(10)

0    232
1    727
2    103
3    109
4    261
5     54
6    132
7    680
8    172
9     89
Name: Deaths, dtype: int32

Here we have used method `astype()` to change type of column’s elements. But why we have written `fillna(0)`?

### Sorting

[[back to top]](#Table-of-Contents)

pandas functionality proposes two kinds of very fast sorting: sorting by label using `sort_index()` and sorting by actual values `order()` for Series and `sort()` for DataFrame. Let’s note that both sorting procedures don’t return a new object by default, except by passing attribute `inplace=True`. For applying of `sort()` method to a DataFrame you should set an arbitrary vector or a column name of the DataFrame to determine the sort order. Otherwise `sort()` works as well as `sort_index()`. By default pandas return an object in ascending order. For changing it to descending order you should set attribute `ascending=False`.


In [121]:
sdf2.sort_index().head(10)

Unnamed: 0,OBJECTID,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Recovered,Deaths,Active,Admin2,FIPS,Combined_Key,Incident_Rate,People_Tested,People_Hospitalized,UID,ISO3,SHAPE
0,1,Alabama,US,2023-03-10 13:21:02,32.539527,-86.644082,19790,,232,,Autauga,1001,"Autauga, Alabama, US",35422.14824,,,84001001,USA,"{""x"": -86.64408226999996, ""y"": 32.539527450000..."
1,2,Alabama,US,2023-03-10 13:21:02,30.72775,-87.722071,69860,,727,,Baldwin,1003,"Baldwin, Alabama, US",31294.516068,,,84001003,USA,"{""x"": -87.72207057999998, ""y"": 30.727749910000..."
2,3,Alabama,US,2023-03-10 13:21:02,31.868263,-85.387129,7485,,103,,Barbour,1005,"Barbour, Alabama, US",30320.82962,,,84001005,USA,"{""x"": -85.38712859999998, ""y"": 31.868263000000..."
3,4,Alabama,US,2023-03-10 13:21:02,32.996421,-87.125115,8091,,109,,Bibb,1007,"Bibb, Alabama, US",36130.21345,,,84001007,USA,"{""x"": -87.12511459999996, ""y"": 32.996420640000..."
4,5,Alabama,US,2023-03-10 13:21:02,33.982109,-86.567906,18704,,261,,Blount,1009,"Blount, Alabama, US",32345.311797,,,84001009,USA,"{""x"": -86.56790592999994, ""y"": 33.982109180000..."
5,6,Alabama,US,2023-03-10 13:21:02,32.100305,-85.712655,3030,,54,,Bullock,1011,"Bullock, Alabama, US",29997.029997,,,84001011,USA,"{""x"": -85.71265534999998, ""y"": 32.100305330000..."
6,7,Alabama,US,2023-03-10 13:21:02,31.753001,-86.680575,6551,,132,,Butler,1013,"Butler, Alabama, US",33684.697655,,,84001013,USA,"{""x"": -86.68057477999997, ""y"": 31.753000950000..."
7,8,Alabama,US,2023-03-10 13:21:02,33.774837,-85.826304,41421,,680,,Calhoun,1015,"Calhoun, Alabama, US",36460.54311,,,84001015,USA,"{""x"": -85.82630385999994, ""y"": 33.774837270000..."
8,9,Alabama,US,2023-03-10 13:21:02,32.913601,-85.390727,10859,,172,,Chambers,1017,"Chambers, Alabama, US",32654.718229,,,84001017,USA,"{""x"": -85.39072748999996, ""y"": 32.913600790000..."
9,10,Alabama,US,2023-03-10 13:21:02,34.17806,-85.60639,6755,,89,,Cherokee,1019,"Cherokee, Alabama, US",25786.3796,,,84001019,USA,"{""x"": -85.60638967999995, ""y"": 34.178059830000..."


In [122]:
sdf2.sort_index(axis=1).sort_index(ascending=False).head(10)

Unnamed: 0,Active,Admin2,Combined_Key,Confirmed,Country_Region,Deaths,FIPS,ISO3,Incident_Rate,Last_Update,Lat,Long_,OBJECTID,People_Hospitalized,People_Tested,Province_State,Recovered,SHAPE,UID
3271,,Weston,"Weston, Wyoming, US",1905,US,23,56045,USA,27501.08272,2023-03-10 13:21:02,43.839612,-104.567488,3272,,,Wyoming,,"{""x"": -104.56748809999999, ""y"": 43.83961191000...",84056045
3270,,Washakie,"Washakie, Wyoming, US",2755,US,51,56043,USA,35297.885971,2023-03-10 13:21:02,43.904516,-107.680187,3271,,,Wyoming,,"{""x"": -107.68018699999999, ""y"": 43.90451606000...",84056043
3269,,Unassigned,"Unassigned, Wyoming, US",0,US,0,90056,USA,,2023-01-08 23:21:00,,,3270,,,Wyoming,,,84090056
3268,,Uinta,"Uinta, Wyoming, US",6406,US,43,56041,USA,31672.105211,2023-03-10 13:21:02,41.287818,-110.547578,3269,,,Wyoming,,"{""x"": -110.54757819999998, ""y"": 41.28781830000...",84056041
3267,,Teton,"Teton, Wyoming, US",12134,US,16,56039,USA,51713.262871,2023-03-10 13:21:02,43.935225,-110.58908,3268,,,Wyoming,,"{""x"": -110.58908009999999, ""y"": 43.93522482000...",84056039
3266,,Sweetwater,"Sweetwater, Wyoming, US",12507,US,139,56037,USA,29537.349739,2023-03-10 13:21:02,41.659439,-108.882788,3267,,,Wyoming,,"{""x"": -108.8827882, ""y"": 41.659438960000045, ""...",84056037
3265,,Sublette,"Sublette, Wyoming, US",2316,US,28,56035,USA,23558.132438,2023-03-10 13:21:02,42.765583,-109.913092,3266,,,Wyoming,,"{""x"": -109.91309219999994, ""y"": 42.76558279000...",84056035
3264,,Sheridan,"Sheridan, Wyoming, US",10008,US,90,56033,USA,32829.260292,2023-03-10 13:21:02,44.790489,-106.886239,3265,,,Wyoming,,"{""x"": -106.88623889999997, ""y"": 44.79048913000...",84056033
3263,,Platte,"Platte, Wyoming, US",2299,US,44,56031,USA,27391.874181,2023-03-10 13:21:02,42.132991,-104.966331,3264,,,Wyoming,,"{""x"": -104.96633099999997, ""y"": 42.13299116000...",84056031
3262,,Park,"Park, Wyoming, US",7713,US,153,56029,USA,26419.81229,2023-03-10 13:21:02,44.521575,-109.585283,3263,,,Wyoming,,"{""x"": -109.58528249999995, ""y"": 44.52157546000...",84056029


In [123]:
sdf2.sort_values(['Deaths', 'Active'], ascending=[0, 0], na_position='first').head(20)

Unnamed: 0,OBJECTID,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Recovered,Deaths,Active,Admin2,FIPS,Combined_Key,Incident_Rate,People_Tested,People_Hospitalized,UID,ISO3,SHAPE
211,212,California,US,2023-03-10 13:21:02,34.308284,-118.228241,3710586,,35545,,Los Angeles,6037,"Los Angeles, California, US",36961.315384,,,84006037,USA,"{""x"": -118.22824109999999, ""y"": 34.30828379000..."
107,108,Arizona,US,2023-03-10 13:21:02,33.348359,-112.491815,1530296,,18846,,Maricopa,4013,"Maricopa, Arizona, US",34117.162875,,,84004013,USA,"{""x"": -112.49181539999995, ""y"": 33.34835867000..."
628,629,Illinois,US,2023-03-10 13:21:02,41.841448,-87.816588,1533935,,15289,,Cook,17031,"Cook, Illinois, US",29783.798131,,,84017031,USA,"{""x"": -87.81658793999998, ""y"": 41.841448490000..."
1895,1896,New York,US,2023-03-10 13:21:02,40.636182,-73.949356,963672,,14205,,Kings,36047,"Kings, New York, US",37644.863887,,,84036047,USA,"{""x"": -73.94935551999998, ""y"": 40.636182500000..."
1913,1914,New York,US,2023-03-10 13:21:02,40.710881,-73.816847,904779,,13415,,Queens,36081,"Queens, New York, US",40143.567164,,,84036081,USA,"{""x"": -73.81684711999998, ""y"": 40.710881240000..."
375,376,Florida,US,2023-03-10 13:21:02,25.611236,-80.551706,1552197,,12285,,Miami-Dade,12086,"Miami-Dade, Florida, US",57130.337807,,,84012086,USA,"{""x"": -80.55170586999998, ""y"": 25.611236200000..."
2761,2762,Texas,US,2023-03-10 13:21:02,29.858649,-95.393395,1275028,,11623,,Harris,48201,"Harris, Texas, US",27051.561265,,,84048201,USA,"{""x"": -95.39339520999994, ""y"": 29.858649390000..."
1789,1790,Nevada,US,2023-03-10 13:21:02,36.214589,-115.013024,671243,,9313,,Clark,32003,"Clark, Nevada, US",29613.03031,,,84032003,USA,"{""x"": -115.01302409999994, ""y"": 36.21458855000..."
1346,1347,Michigan,US,2023-03-10 13:21:02,42.280984,-83.281255,537017,,9107,,Wayne,26163,"Wayne, Michigan, US",30698.2107,,,84026163,USA,"{""x"": -83.28125499999999, ""y"": 42.280984050000..."
1874,1875,New York,US,2023-03-10 13:21:02,40.852093,-73.862828,553117,,8529,,Bronx,36005,"Bronx, New York, US",39001.147223,,,84036005,USA,"{""x"": -73.86282754999996, ""y"": 40.852093010000..."


Here the first argument represent `list` of DataFrame’s columns, the seconds one denotes sorting order for corresponding column and the last one defines the position where null values will be placed. 

And let’s give an example of Series sorting:

In [124]:
sdf2['Combined_Key'].sort_values()

2450    Abbeville, South Carolina, US
1138            Acadia, Louisiana, US
2945           Accomack, Virginia, US
568                    Ada, Idaho, US
810                   Adair, Iowa, US
                    ...              
116                 Yuma, Arizona, US
318                Yuma, Colorado, US
2914                Zapata, Texas, US
2915                Zavala, Texas, US
2563        Ziebach, South Dakota, US
Name: Combined_Key, Length: 3272, dtype: string

Let’s note that previous pandas versions (before 0.17.0) contain other method for sorting by values: `sort_values(inplace=True)` for Series and `sort_values(by=[“column’s name”])` for DataFrame.

It is important to note that Series has the `nsmallest()` and `nlargest()` methods which return the smallest or largest `n` values. For a large Series this can be much faster than sorting the entire Series and calling `head(n)` on the result.


In [125]:
sdf2['Deaths'].nlargest(3)

211    35545
107    18846
628    15289
Name: Deaths, dtype: Int32

In [126]:
sdf2['Deaths'].nsmallest(5)

52    0
64    0
73    0
93    0
95    0
Name: Deaths, dtype: Int32

### Selecting by type

[[back to top]](#Table-of-Contents)

You already know how to see types of each column of a DataFrame (with the help of `dtypes`, for example) and how to change type of any DataFrames’s column or row (by using `astype()` method). But what to do if you need to select a specific column of a certain type? Method `select_dtypes()` makes this issue very easy. Let’s create a DataFrame with data of many different types to demonstrate its work rather than use one of the provided datasets.


In [127]:
import datetime
types_df = pd.DataFrame({  'int': list(range(3)),
                           'float': [1.1, 2.2, 3.3],
                           'bool': [False, True, False],
                           'string': list('abc'),
                           'undefined': [2>1, pd.isnull(np.inf),isinstance([],list)],
                           'shuffled': [datetime.datetime.now(), [np.nan, np.inf], type('A')],
                           'date': pd.date_range('20151120', periods=3).values
                        })
types_df

Unnamed: 0,int,float,bool,string,undefined,shuffled,date
0,0,1.1,False,a,True,2025-04-16 12:47:49.165862,2015-11-20
1,1,2.2,True,b,False,"[nan, inf]",2015-11-21
2,2,3.3,False,c,True,<class 'str'>,2015-11-22


In [128]:
types_df.dtypes

int                   int64
float               float64
bool                   bool
string               object
undefined              bool
shuffled             object
date         datetime64[ns]
dtype: object

Pay attention that pandas defines Python type str as type object. 

Let’s select only boolean columns


In [129]:
types_df.select_dtypes(include=['bool'])   
# or types_df.select_dtypes(include=[bool])

Unnamed: 0,bool,undefined
0,False,True
1,True,False
2,False,True


or remain all columns which are have no bool or object types

In [130]:
types_df.select_dtypes(exclude=['bool', 'object']) 
# or types_df.select_dtypes(include=['datetime64[ns]','float64', 'int64'])


Unnamed: 0,int,float,date
0,0,1.1,2015-11-20
1,1,2.2,2015-11-21
2,2,3.3,2015-11-22
