# Week 14
#### How to:
* Read .csv data from a external file by using the Pandas library
* Read a table from the web page by using the Pandas library
* Load a SAS data set into a Pandas dataframe by using the SASPy module
* Load a Pandas dataframe into SAS data set
* Generate the SAS code by using SASPy module
* Generate a profile of the Python data object created from the SAS data set
* Use various file management functions


## pd.read_csv() reads a raw data file and produces a dataframe.
### Attributes
* Use df.dtypes to find out the data type of each column of the data frame named df

In [None]:
import pandas as pd
df=pd.read_csv('/Data/TV_Data_noheader.csv', names=['opinion', 'party', 'income', 'age'])
#print(df)
type(df)


In [26]:
 df.dtypes

opinion    object
party      object
income      int64
age         int64
dtype: object

In [27]:
df.head()

Unnamed: 0,opinion,party,income,age
0,Agree,Republican,35000,30
1,Disagree,Democrat,40000,50
2,Strongly Agree,Independent,100000,40
3,No Opinion,Green Party,90000,45
4,Strongly Disagree,Democrat,65000,45


## pd.read_html() imports a table from a webpage and produces a list, not a dataframe.

In [10]:
import pandas as pd
url = "https://simple.wikipedia.org/wiki/List_of_U.S._states"
data = pd.read_html(url, index_col=0)
type(data)


list

### Loading a SAS data set into a Pandas dataframe by using the SASPy module

In [6]:
import saspy
import pandas as pd
sas = saspy.SASsession(cfgname='winlocal')
iris = sas.sasdata("iris","SASHELP")
iris.describe()

SAS Connection established. Subprocess id is 33804



Unnamed: 0,Variable,Label,N,NMiss,Median,Mean,StdDev,Min,P25,P50,P75,Max
0,SepalLength,Sepal Length (mm),150,0,58.0,58.433333,8.280661,43,51,58.0,64,79
1,SepalWidth,Sepal Width (mm),150,0,30.0,30.573333,4.358663,20,28,30.0,33,44
2,PetalLength,Petal Length (mm),150,0,43.5,37.58,17.652982,10,16,43.5,51,69
3,PetalWidth,Petal Width (mm),150,0,13.0,11.993333,7.622377,1,3,13.0,18,25


### Running SAS programs in Python notebooks by using a JupyterLab magic command (%%SAS)

In [3]:
import saspy

In [4]:
%%SAS
proc means data=sashelp.iris; run;

Using SAS Config named: winlocal
SAS Connection established. Subprocess id is 41296



Variable,Label,N,Mean,Std Dev,Minimum,Maximum
SepalLength SepalWidth PetalLength PetalWidth,Sepal Length (mm) Sepal Width (mm) Petal Length (mm) Petal Width (mm),150 150 150 150,58.4333333 30.5733333 37.5800000 11.9933333,8.2806613 4.3586628 17.6529823 7.6223767,43.0000000 20.0000000 10.0000000 1.0000000,79.0000000 44.0000000 69.0000000 25.0000000


### Loading a SAS data set into a Pandas dataframe by using the SASPy module

In [3]:
import saspy
sas = saspy.SASsession(cfgname='winlocal')
class_sds = sas.sasdata2dataframe(table='class', libref='sashelp')
class_sds.describe()  

SAS Connection established. Subprocess id is 29804



Unnamed: 0,Age,Height,Weight
count,19.0,19.0,19.0
mean,13.315789,62.336842,100.026316
std,1.492672,5.127075,22.773933
min,11.0,51.3,50.5
25%,12.0,58.25,84.25
50%,13.0,62.8,99.5
75%,14.5,65.9,112.25
max,16.0,72.0,150.0


 ### Generating SAS code by using the SASPy module

In [1]:
import saspy
import pandas as pd
sas = saspy.SASsession(cfgname='winlocal')
w_class = sas.sasdata("CARS","SASHELP")
code=sas.teach_me_SAS(1)
w_class.columnInfo()


SAS Connection established. Subprocess id is 26456

proc contents data=SASHELP.CARS ;ods select Variables;run;


In [2]:
import saspy
import pandas as pd
sas = saspy.SASsession(cfgname='winlocal')
%cd C:\Data
p_cars = pd.read_sas('cars.sas7bdat', format='sas7bdat', encoding="utf-8")
p_cars.describe()

SAS Connection established. Subprocess id is 17648

C:\Data


Unnamed: 0,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
count,428.0,428.0,428.0,426.0,428.0,428.0,428.0,428.0,428.0,428.0
mean,32774.85514,30014.700935,3.196729,5.807512,215.885514,20.060748,26.843458,3577.953271,108.154206,186.36215
std,19431.716674,17642.11775,1.108595,1.558443,71.836032,5.238218,5.741201,758.983215,8.311813,14.357991
min,10280.0,9875.0,1.3,3.0,73.0,10.0,12.0,1850.0,89.0,143.0
25%,20334.25,18866.0,2.375,4.0,165.0,17.0,24.0,3104.0,103.0,178.0
50%,27635.0,25294.5,3.0,6.0,210.0,19.0,26.0,3474.5,107.0,187.0
75%,39205.0,35710.25,3.9,6.0,255.0,21.25,29.0,3977.75,112.0,194.0
max,192465.0,173560.0,8.3,12.0,500.0,60.0,66.0,7190.0,144.0,238.0


In [3]:
%%SAS
proc print data=sashelp.class (obs=5); 
run;

Using SAS Config named: winlocal
SAS Connection established. Subprocess id is 23040



Obs,Name,Sex,Age,Height,Weight
1,Alfred,M,14,69.0,112.5
2,Alice,F,13,56.5,84.0
3,Barbara,F,13,65.3,98.0
4,Carol,F,14,62.8,102.5
5,Henry,M,14,63.5,102.5


### Generating a profile of the Python data object created from the SAS data set

In [1]:
import saspy
import pandas_profiling
sas = saspy.SASsession(cfgname='winlocal')
p_cars = sas.sasdata2dataframe(table='cars', libref='sashelp')
pandas_profiling.ProfileReport(p_cars)

SAS Connection established. Subprocess id is 28328





### How to get the current directory name?

In [23]:
import os
print("Path at terminal when executing this file")
print(os.getcwd() + "\n")

Path at terminal when executing this file
C:\Misc



### How to get the current directory name using the Jupyter notebook magic command?

In [24]:
%pwd

'C:\\Misc'

In [8]:
import os
os.getcwd()
os.chdir("C:/Data")
files = os.listdir(os.curdir)
print(files)

['Complete_output.rtf', 'Create_formats.sas', 'c_deads_2002.sas7bdat', 'Deads_2002.sas', 'Deads_data.pdf', 'formats.sas7bcat', 'FromProcFreq.pdf', 'h171.sas7bdat', 'h181.sas7bdat', 'h192.sas7bdat', 'New_Borns_2002.sas', 'panel6.sas7bdat', 'panel7.sas7bdat', 'Partial_output.rtf', 'report_e.xlsx', 'report_h.html', 'report_p.pdf', 'report_r.rtf', 'sample_data.txt', 'Subdir1', 'Subdir2', 'Traffic_Lighting.html', 'yr_2006_2016.sas7bdat']


In [9]:
import os
os.environ["TEMP"]

'C:\\Users\\pmuhuri\\AppData\\Local\\Temp'

import os
for a in os.environ:
    print(a, os.getenv(a))


import os
for a in os.environ:
    print('Var: ', a, 'Value: ', os.getenv(a))
print("all done")

### How to list all directories, subdirectories, and files from a specified directory?

The following code renders a horizontal list of entries.

In [15]:
import os
print(os.listdir("C:/Data")) 

['cars.sas7bdat', 'Class_Exercise_Folder.docx', 'Create_cars.sas', 'Data_label.sas', 'Download_ssp.sas', 'Index_cats_Functions.sas', 'Modified_list.sas', 'sascfg.docx', 'SAS_9.4_M6_Installation_Issues.docx', 'SAS_Refs.docx', 'Spring_2020_classroom_booking.docx', 'Subdir1', 'Subdir2', 'TV_Data_noheader.csv']


### How to list all directories, subdirectories, and files from a specified directory?

The following code renders a vertical list of entries.

In [10]:
import os
for p in os.listdir("C:/Data"):
    print(p)

Complete_output.rtf
Create_formats.sas
c_deads_2002.sas7bdat
Deads_2002.sas
Deads_data.pdf
formats.sas7bcat
FromProcFreq.pdf
h171.sas7bdat
h181.sas7bdat
h192.sas7bdat
New_Borns_2002.sas
panel6.sas7bdat
panel7.sas7bdat
Partial_output.rtf
report_e.xlsx
report_h.html
report_p.pdf
report_r.rtf
sample_data.txt
Subdir1
Subdir2
Traffic_Lighting.html
yr_2006_2016.sas7bdat


### How to list all directories, subdirectories, and files from the current working directory?

The following code provides a vertical list  of entries.

In [11]:
import os
os.chdir("C:/Misc")
names = os.listdir()
for p in names:
    print(p)

### How to list all files from a specified directory but no directories or subdirectories? (Efficient method)

In [18]:
import glob
path = 'C:\\Data\\'
files = (f for f in glob.glob(path + '**/*.sas', recursive=True))
for f in files:
    print(f)

C:\Data\Create_cars.sas
C:\Data\Data_label.sas
C:\Data\Download_ssp.sas
C:\Data\Index_cats_Functions.sas
C:\Data\Modified_list.sas


### How to list all files from a specified directory but no directories or subdirectories?

In [12]:
import os
path = 'C:\\Data'
files = []
for r, d, f in os.walk(path):
    for file in f:
        if '.sas' in file:
            files.append(os.path.join(r, file))
            
for f in files:
    print(f)

C:\Data\Create_formats.sas
C:\Data\c_deads_2002.sas7bdat
C:\Data\Deads_2002.sas
C:\Data\formats.sas7bcat
C:\Data\h171.sas7bdat
C:\Data\h181.sas7bdat
C:\Data\h192.sas7bdat
C:\Data\New_Borns_2002.sas
C:\Data\panel6.sas7bdat
C:\Data\panel7.sas7bdat
C:\Data\yr_2006_2016.sas7bdat
C:\Data\Subdir2\fclass.sas7bdat
C:\Data\Subdir2\h105.sas7bdat
C:\Data\Subdir2\h113.sas7bdat
C:\Data\Subdir2\h121.sas7bdat
C:\Data\Subdir2\h129.sas7bdat
C:\Data\Subdir2\h138.sas7bdat
C:\Data\Subdir2\h147.sas7bdat
C:\Data\Subdir2\h155.sas7bdat
C:\Data\Subdir2\h163.sas7bdat


### How to list all directories and subdirectories from a specified directory but no files?

In [13]:
import os
path = 'C:\\Data'
folders = []
for r, d, f in os.walk(path):
    for folder in d:
        folders.append(os.path.join(r, folder))
            
for f in folders:
    print(f)

C:\Data\Subdir1
C:\Data\Subdir2


In [14]:
import pandas as pd
from pathlib import Path
import time

p = Path("C:/Data")
all_files = []
for i in p.rglob('*.SAS'):
    all_files.append((i.name, i.parent, time.ctime(i.stat().st_ctime)))

columns = ["File_Name","Parent", "Created"]
df = pd.DataFrame.from_records(all_files, columns=columns)
print(df.to_string(index=False))


File_Name   Parent                   Created
Create_formats.sas  C:\Data  Mon Mar  4 21:09:48 2019
    Deads_2002.sas  C:\Data  Mon Mar  4 21:00:54 2019
New_Borns_2002.sas  C:\Data  Mon Mar  4 21:00:54 2019


In [15]:
import pandas as pd
from pathlib import Path
import time

p = Path("C:/Data")
all_files = []
for i in p.rglob('*.SAS'):
    all_files.append((i.name, time.ctime(i.stat().st_ctime)))

columns = ["File_Name", "Created"]
df = pd.DataFrame.from_records(all_files, columns=columns)
print(df.to_string(index=False))

File_Name                   Created
Create_formats.sas  Mon Mar  4 21:09:48 2019
    Deads_2002.sas  Mon Mar  4 21:00:54 2019
New_Borns_2002.sas  Mon Mar  4 21:00:54 2019


[How to select rows and columns in Pandas using [ ], .loc, iloc, .at and .iat](https://www.kdnuggets.com/2019/06/select-rows-columns-pandas.html)

In [16]:
import pandas as pd
from pathlib import Path
import time

p = Path("C:/Data")
all_files = []
for i in p.rglob('*.SAS'):
    all_files.append((i.name, time.ctime(i.stat().st_ctime)))

columns = ["File_Name", "Created"]
df = pd.DataFrame.from_records(all_files, columns=columns)
xdf=df.iloc[:,[0]]
print(xdf.to_string(index=False))

File_Name
Create_formats.sas
    Deads_2002.sas
New_Borns_2002.sas


In [17]:
from pathlib import Path
dir =  Path('C:/Data')
files = dir.glob('*.sas')
for i in files:
    print(i)

C:\Data\Create_formats.sas
C:\Data\Deads_2002.sas
C:\Data\New_Borns_2002.sas


In [21]:
import glob
for name in glob.glob("C:/Data/*.sas"):
    print(name)

C:/Data\Create_formats.sas
C:/Data\Deads_2002.sas
C:/Data\New_Borns_2002.sas


In [22]:
from pathlib import Path
my_file = Path("/Data/Create_formats.sas")
my_file.is_file() 

True

In [26]:
from saspy import autocfg
autocfg.main()

CFGFILE ALREADY EXISTS: C:\Users\pmuhuri\AppData\Local\Continuum\anaconda3\lib\site-packages\saspy\sascfg_personal.py


In [28]:
fd = open('C:\Data\Deads_2002.sas')
print(fd.read())
fd.close()


LIBNAME library  'C:\Data';
LIBNAME new  'C:\Data';
data Deads_2002;
  set new.panel6 (in=a) new.panel7 (in=b);
  array pstats(3) pstats31 pstats42 pstats53;
  
 if 23 in pstats then found_dead=1;
 else if 24 in pstats then found_dead=1;
 else if 31 in pstats then found_dead=1;
 else found_dead=0;  
 
if a=1 then panel=6; else panel=7;
if found_dead=1  then output;
 run;
 data new.c_Deads_2002; 
   set Deads_2002;
   cum_count+count;
run;

ods pdf file='C:\Data\Deads_data.pdf' ;
options nocenter nodate nonumber ls=132 leftmargin=.1in rightmargin=1in
options nocenter ls=132;
title 'Insope status of Deads in MEPS 2002';
proc print data=new.c_Deads_2002 noobs;
var Panel 
inscope INSC1231 inscop02 begrfy endrfy inscop31 pstats31 inscop42 pstats42
        inscop53 pstats53 cum_count ;
run;
ods pdf close;




### How to know the pandas version

In [32]:
pd.__version__

'0.24.2'

### Use the show_versions() function to know the versions of pandas' dependencies

In [30]:
import pandas as pd
pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.4
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None


In [31]:
from platform import python_version
print(python_version())


3.7.1


In [None]:
import platform
print(platform.sys.version)
help('modules')

In [2]:
!jupyter kernelspec list

Available kernels:
  662a849e-529c-49bd-8309-bde2acd05cb2    C:\Users\pmuhuri\AppData\Roaming\jupyter\kernels\662a849e-529c-49bd-8309-bde2acd05cb2
  ir                                      C:\Users\pmuhuri\AppData\Roaming\jupyter\kernels\ir
  sas                                     C:\Users\pmuhuri\AppData\Roaming\jupyter\kernels\sas
  python3                                 C:\Users\pmuhuri\AppData\Local\Continuum\anaconda3\share\jupyter\kernels\python3


In [2]:
!python --version

Python 3.7.1


In [3]:
!jupyter --version

4.4.0


In [43]:
!R --version

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

