Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.read_fwf ignoring a space when reading space delimited file #23455

Closed
paulvija opened this issue Nov 2, 2018 · 6 comments
Closed

pd.read_fwf ignoring a space when reading space delimited file #23455

paulvija opened this issue Nov 2, 2018 · 6 comments
Labels
IO Data IO issues that don't fit into a more specific label

Comments

@paulvija
Copy link

paulvija commented Nov 2, 2018

Code Sample

temperature_lst=
   41485    177.40357149    -39.01654000   13.000    13.92  .0100  .2000    -1.02  .0100  .2000          
   41489    171.56518149    -42.94461000  743.000     7.29  .0100  .3333      .51  .0100  .3333        
df = pd.read_fwf('./temperature_lst',names=['agn','longt','lat','v1','v2','v3', 'v4','v5','v6', 'v7'])

df.tail()
agn       longt       lat     v1     v2    v3      v4    v5    v6      v7
41485  177.403571 -39.01654   13.0  13.92  0.01  0.2000  1.02  0.01  0.2000
41489  171.565181 -42.94461  743.0   7.29  0.01  0.3333  0.51  0.01  0.3333

Problem description

temperature_lst is a space delimited file, when using read_fwf to read this in to a dataframe some space is truncated in column 'v5'. This results in '-1.02' becoming just '1.02'.
Is this is a bug?

However, this works when 'widths' is defined. In my case, there are multiple files that are read in this way, so using 'widths' can be tedious.

 df = pd.read_fwf('./11068_temperature_anom.lst',names=['agn','longt','lat','v1','v2','v3', 'v4','v5',
    ...: 'v6', 'v7'], widths=[8,16,16,9,9,7,7,9,7,7])
@dsaxton
Copy link
Member

dsaxton commented Nov 2, 2018

Can you rework your example so that it's valid code? (In any case if I try creating a sample file with the contents of temperature_lst I seem to get a correct result, so it may be that the file you're referencing is malformed.)

@paulvija paulvija closed this as completed Nov 2, 2018
@paulvija paulvija reopened this Nov 2, 2018
@paulvija
Copy link
Author

paulvija commented Nov 2, 2018

Sorry, my bad tapped on close button by mistake.
I will attach temperature_lst soon for your reference.

@gfyoung gfyoung added IO Data IO issues that don't fit into a more specific label Needs Info Clarification about behavior needed to assess issue labels Nov 2, 2018
@TomAugspurger
Copy link
Contributor

See http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports for writing examples. You should be able to use StringIO to make an example, no need for a separate text file.

@paulvija
Copy link
Author

paulvija commented Nov 4, 2018

Thanks for that guys.
Sorry - the datafile is reasonably huge. This is a selection of temperature_lst, if I trim this short again then the problem goes away. Also, found out that if I skip the first row, then the last row (7th col) value '-1.05' remains the same and no problem. So I am wondering what is going on.

data = """   11104    168.31780110    -46.41104000    1.000     9.94  .0100  .0455      .28  .0100  .0455          
    2006    174.86384201    -37.20637000   88.000    13.55  .0100  .0476     -.24  .0100  .0476          
    2006    174.86384201    -37.20637000   88.000    13.55  .0100  .0476     -.24  .0100  .0476          
    2112    175.33175211    -37.86088000   53.000    13.03  .0100  .0400     -.38  .0100  .0400          
    2136    174.70700214    -38.17100000   27.000    14.13  .0103  .0556     -.45  .0103  .0556          
    2283    174.18382228    -39.00790000   30.000    12.75  .0100  .0400     -.33  .0100  .0400          
    2592    176.21180259    -40.90417000  120.000    13.03  .0103  .0385      .24  .0103  .0385          
    2685    175.23300269    -41.58900000    6.000    13.77  .0103  .0625      .31  .0103  .0625          
    2692    178.31400269    -37.56400000   46.000    13.84  .0100  .0417      .36  .0100  .0417          
    2810    177.98510281    -38.65860000    5.000    13.95  .0100  .0357     -.26  .0100  .0357          
    2980    176.86459298    -39.47006000    3.000    13.61  .0100  .0400     -.07  .0100  .0400          
    3017    176.84100302    -39.64800000   16.000    13.03  .0100  .0625      .55  .0100  .0625          
    3126    177.41300313    -39.01700000   15.000    13.92  .0100  .0667     -.49  .0100  .0667          
    3142    177.96180314    -39.11738000  136.000    13.34  .0100  .0435      .26  .0100  .0435          
    3145    174.98400315    -40.90700000    5.000    12.44  .0100  .0154      .30  .0100  .0154          
    3243    175.61459324    -40.31858000   40.000    12.45  .0100  .0417      .28  .0100  .0417          
    3275    175.25700328    -40.62200000   14.000    12.53  .0100  .0385      .45  .0100  .0385          
    3445    174.80400345    -41.32200000    4.000    12.95  .0100  .0185      .34  .0100  .0185          
    3715    175.04496372    -39.93882000   15.000    13.35  .0100  .0135     -.40  .0100  .0135          
    3719    175.02400372    -39.96200000    8.000    13.03  .0100  .0345     -.12  .0100  .0345          
    3798    173.00000380    -40.54700000    3.000    13.16  .0100  .0357      .92  .0100  .0357          
    3909    170.98428391    -42.71228000   39.000    11.15  .0100  .0185      .42  .0100  .0185          
    3910    170.98428391    -42.71228000   38.000    11.15  .0100  .0370      .33  .0100  .0370          
    3925    171.86014393    -42.11578000  198.000    11.29  .0100  .0250     1.09  .0100  .0250          
    4097    169.00700410    -43.86100000    5.000    10.37  .0106  .0500      .60  .0106  .0500          
    4107    167.92406411    -44.67384000    3.000    10.14  .0100  .0167      .54  .0100  .0167          
    4141    166.61299414    -46.15620000   44.000    10.14  .0100  .0588      .16  .0100  .0588          
    4241    173.22600424    -41.29900000    2.000    12.44  .0100  .0143      .50  .0100  .0143          
    4271    173.21900427    -41.30200000    4.000    12.56  .0100  .0417      .32  .0100  .0417          
    4326    173.86439433    -41.52133000   35.000    12.34  .0100  .0400      .49  .0100  .0400          
    4395    174.44168440    -41.10320000   68.000    12.55  .0100  .0667      .11  .0100  .0667          
    4424    174.27620442    -41.72650000    2.000    12.26  .0100  .0667      .01  .0100  .0667          
    4506    173.69600451    -42.42000000  105.000    11.74  .0100  .0385      .10  .0100  .0385          
    4843    172.53700484    -43.49300000   37.000    11.46  .0100  .0159     -.30  .0100  .0159          
    4960    173.11900496    -43.74600000  236.000    10.55  .0100  .0500     -.16  .0100  .0500          
    5086    171.22141509    -44.30517000   26.000    10.46  .0100  .0385     -.28  .0100  .0385          
    5142    171.08100514    -44.97300000   30.000    10.36  .0100  .0588      .11  .0100  .0588          
    5212    169.88948521    -44.52612000  488.000     9.87  .0100  .0526      .69  .0100  .0526          
    5430    167.64166543    -45.53283000  209.000     9.33  .0100  .0500     -.23  .0100  .0500          
    5451    168.74005545    -45.01758000  354.000     9.85  .0100  .0417      .09  .0100  .0417          
    5496    168.44900550    -45.74800000  187.000     9.83  .0100  .0303      .24  .0100  .0303          
    5535    169.68419554    -45.04010000  375.000    10.06  .0100  .0417      .45  .0100  .0417          
    5778    168.88700578    -46.11500000  123.000     9.94  .0100  .0417      .40  .0100  .0417          
    5814    168.33052581    -46.41741000    1.000     9.95  .0100  .0149      .23  .0100  .0149          
    5823    168.37600582    -46.58700000    5.000    10.35  .0103  .0250      .33  .0103  .0250          
    5893    169.81000589    -46.44900000  129.000     9.93  .0100  .0345      .29  .0100  .0345          
    5909    167.46290591    -47.27675000  101.000     9.63  .0100  .0556      .46  .0100  .0556          
    7339    170.19684734    -45.92675000    1.000    10.37  .0100  .0385      .43  .0100  .0385          
    7342    171.57963734    -41.74055000    2.000    11.85  .0100  .0455      .27  .0100  .0455          
    7426    169.24400743    -44.72200000  352.000    10.59  .0100  .0435      .68  .0100  .0435          
    8567    174.98400857    -40.90700000    5.000    12.43  .0100  .0435      .34  .0100  .0435          
    9533    166.88600953    -45.22100000   19.000    10.93  .0119  .0556      .28  .0119  .0556          
    9654    175.11540965    -35.90290000   60.000    14.95  .0111  .0667      .59  .0111  .0667          
   11104    168.31780110    -46.41104000    1.000     9.94  .0100  .0455      .28  .0100  .0455          
   11234    172.85099123    -42.53433000  363.000    10.24  .0100  .0500      .05  .0100  .0500          
   12428    176.32401243    -37.82200000   91.000    13.34  .0100  .0588      .35  .0100  .0588          
   12429    172.97166243    -41.09798000    8.000    12.25  .0100  .0476      .90  .0100  .0476          
   12430    173.96286243    -41.49891000    4.000    12.94  .0100  .0476      .78  .0100  .0476          
   12442    174.98438244    -40.90392000    5.000    12.43  .0100 1.0000      .08  .0100 1.0000          
   12444    168.33046244    -46.41727000     .000     9.95  .0100 1.0000      .07  .0100 1.0000          
   12482    167.27501248    -45.52500000  178.000     8.95  .0100  .0667      .17  .0100  .0667          
   15752    170.51471575    -45.90129000    4.000    10.95  .0100  .0500      .59  .0100  .0500          
   15876    176.91149588    -39.60698000    5.000    12.73  .0100  .0588     -.31  .0100  .0588          
   17030    175.73497703    -37.87683000   85.000    12.64  .0100  .0714     -.12  .0100  .0714          
   17067    173.26201707    -35.13500000   85.000    14.45  .0100  .0667      .18  .0100  .0667          
   17244    172.61115724    -43.32858000   23.000    11.33  .0100  .0588     -.40  .0100  .0588          
   17603    172.47041760    -43.62622000   18.000    11.53  .0100  .0556     -.10  .0100  .0556          
   17838    174.66767784    -36.43435000   72.000    13.56  .0100  .0556     -.75  .0100  .0556          
   18125    170.09601813    -43.73600000  730.000     8.70  .0100  .0667     1.06  .0100  .0667          
   18183    173.28741818    -35.06770000   80.000    14.54  .0100  .0625      .21  .0100  .0625          
   18234    174.86701823    -41.40700000   79.000    12.22  .0100  .2000     -.07  .0100  .2000          
   18309    167.92301831    -44.67700000    3.000    10.45  .0100  .1000      .28  .0100  .1000          
   18437    170.13562844    -45.51814000  213.000    10.18  .0100  .0588      .44  .0100  .0588          
   18464    175.54510846    -39.19591000 1097.000     6.48  .0100  .0625     -.32  .0100  .0625          
   18593    170.10046859    -45.12427000  450.000     9.18  .0100  .0625      .58  .0100  .0625          
   18594    170.82282859    -45.00829000   81.000    10.30  .0100  .0769     -.30  .0100  .0769          
   21938    175.38987194    -41.25231000   20.000    12.17  .0100  .0769      .55  .0100  .0769          
   21963    175.60917196    -40.38195000   21.000    12.55  .0100  .0833      .28  .0100  .0833          
   22719    174.77644272    -36.96177000    5.000    14.45  .0100  .0714     -.08  .0100  .0714          
   23849    172.80570385    -40.86364000   20.000    12.23  .0100  .0714     -.19  .0100  .0714          
   23899    175.15342390    -38.33356000   62.000    13.05  .0100  .0769     -.25  .0100  .0769          
   23908    175.58530391    -37.71958000   48.000    13.52  .0100  .0833     -.96  .0100  .0833          
   23934    171.19159393    -42.46022000    5.000    11.64  .0100  .1111     -.24  .0100  .1111          
   23976    174.62402398    -36.79300000   26.000    13.76  .0100  .0833     -.55  .0100  .0833          
   24120    172.60771412    -43.53074000    6.000    12.08  .0100  .0714      .14  .0100  .0714          
   24926    170.13430493    -43.36548000   80.000    10.49  .0100  .0714      .26  .0100  .0714          
   24945    170.44321495    -44.00173000  762.000     8.88  .0100  .0714      .45  .0100  .0714          
   24976    177.92182498    -38.62747000   12.000    13.95  .0100  .2500      .44  .0100  .2500          
   25119    173.85319512    -35.93145000   66.000    14.32  .0103  .0714     -.01  .0103  .0714          
   25222    174.29172522    -39.61170000   98.000    11.90  .0100  .0769     -.28  .0100  .0769          
   25354    174.76802535    -41.28500000  125.000    12.04  .0100  .0769      .55  .0100  .0769          
   25531    174.78006553    -41.08675000  100.000    12.85  .0100  .1250     -.97  .0100  .1250          
   25643    175.79082564    -38.97532000  350.000    11.35  .0100  .1000     -.18  .0100  .1000          
   25777    172.15570578    -41.27058000   18.000    12.05  .0100  .1250     -.04  .0100  .1250          
   25820    176.26802582    -40.04300000  330.000    11.44  .0100  .0833     -.05  .0100  .0833          
   25821    171.56287582    -42.94152000  745.000     7.11  .0100  .1000     1.03  .0100  .1000          
   25937    170.95002594    -45.10000000   40.000    10.86  .0100  .0909     -.33  .0100  .0909          
   26104    171.45630610    -42.68168000  105.000    11.64  .0106  .1667      .32  .0106  .1667          
   26117    175.30508612    -37.77567000   45.000    13.24  .0100  .0909     -.07  .0100  .0909          
   26163    169.73152616    -46.29282000   11.000    10.54  .0100  .1111      .19  .0100  .1111          
   26170    171.80502617    -43.90500000   89.000    11.75  .0100  .0909    -1.05  .0100  .0909          """ 

df = pd.read_fwf(StringIO(data), header=None)
df.tail(2)

         0           1         2     3      4     5       6     7     8       9
99   26163  169.731526 -46.29282  11.0  10.54  0.01  0.1111  0.19  0.01  0.1111
100  26170  171.805026 -43.90500  89.0  11.75  0.01  0.0909  1.05  0.01  0.0909

@thoo
Copy link
Contributor

thoo commented Nov 20, 2018

@paulvija The colspec is determined by the first 100 rows. So in your case, the width of the 7th col is only 4 up to 100.

Also, found out that if I skip the first row, then the last row (7th col) value '-1.05' remains the same and no problem.

This happens because the last row is now within 100. If you know your width, you can assign like +0.28 in the first line or any line between 0 to 99 index .
When you use the widths, it passes values to colspec. So in that case, you got what you expected.
We just clean up docstrings for read.fwf and hopefully it would be more clear in the future.

colspecs : list of tuple (int, int) or 'infer'. optional
        A list of tuples giving the extents of the fixed-width
        fields of each line as half-open intervals (i.e.,  [from, to[ ).
        String value 'infer' can be used to instruct the parser to try
        detecting the column specifications from the first 100 rows of
        the data which are not being skipped via skiprows (default='infer').

@TomAugspurger
Copy link
Contributor

Closing, let us know if you think there's still something to be done @paulvija.

@TomAugspurger TomAugspurger added this to the No action milestone Nov 20, 2018
@TomAugspurger TomAugspurger removed the Needs Info Clarification about behavior needed to assess issue label Nov 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

5 participants